Jailbreak LLMs through Internal Stance Manipulation

Shuangjie Fu; Du Su; Beining Huang; Fei Sun; Jingang Wang; Wei Chen; Huawei Shen (沈华伟); Xueqi Cheng (程学旗)

Jailbreak LLMs through Internal Stance Manipulation

Shuangjie Fu, Du Su, Beining Huang, Fei Sun, Jingang Wang, Wei Chen, Huawei Shen, Xueqi Cheng

Abstract

To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, modify adversarial prompts to induce LLMs to generate responses that strictly follow a fixed affirmative template. However, we observed that the reliance on the rigid output template is ineffective for certain malicious requests, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that is universally effective across all hostile requests. To achieve this, we explore LLMs’ intrinsic safety mechanism: a refusal stance towards the adversarial prompt is formed in a confined region and ultimately leads to a rejective response. In light of this, we propose Stance Manipulation (SM), a novel automated jailbreak approach that generates jailbreak prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate the superiority of SM’s performance. Under commenly used setting, SM achieves success rates over 77.1% across all models on Advbench. Specifically, for Llama-2-7b-chat, SM outperforms the best baseline by 25.4%. In further experiments with extended iterations in a speedup setup, SM achieves over 92.2% attack success rate across all models. Our code is publicly available at https://github.com/Zed630/Stance-Manipulation.

Anthology ID:: 2025.emnlp-main.780
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15455–15470
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.780/
DOI:
Bibkey:
Cite (ACL):: Shuangjie Fu, Du Su, Beining Huang, Fei Sun, Jingang Wang, Wei Chen, Huawei Shen, and Xueqi Cheng. 2025. Jailbreak LLMs through Internal Stance Manipulation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15455–15470, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Jailbreak LLMs through Internal Stance Manipulation (Fu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.780.pdf
Checklist:: 2025.emnlp-main.780.checklist.pdf

PDF Cite Search Checklist Fix data