Jailbreak LLMs through Internal Stance Manipulation
Shuangjie Fu, Du Su, Beining Huang, Fei Sun, Jingang Wang, Wei Chen, Huawei Shen, Xueqi Cheng
Abstract
To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, modify adversarial prompts to induce LLMs to generate responses that strictly follow a fixed affirmative template. However, we observed that the reliance on the rigid output template is ineffective for certain malicious requests, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that is universally effective across all hostile requests. To achieve this, we explore LLMs’ intrinsic safety mechanism: a refusal stance towards the adversarial prompt is formed in a confined region and ultimately leads to a rejective response. In light of this, we propose Stance Manipulation (SM), a novel automated jailbreak approach that generates jailbreak prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate the superiority of SM’s performance. Under commenly used setting, SM achieves success rates over 77.1% across all models on Advbench. Specifically, for Llama-2-7b-chat, SM outperforms the best baseline by 25.4%. In further experiments with extended iterations in a speedup setup, SM achieves over 92.2% attack success rate across all models. Our code is publicly available at https://github.com/Zed630/Stance-Manipulation.- Anthology ID:
- 2025.emnlp-main.780
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15455–15470
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.780/
- DOI:
- Cite (ACL):
- Shuangjie Fu, Du Su, Beining Huang, Fei Sun, Jingang Wang, Wei Chen, Huawei Shen, and Xueqi Cheng. 2025. Jailbreak LLMs through Internal Stance Manipulation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15455–15470, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Jailbreak LLMs through Internal Stance Manipulation (Fu et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.780.pdf