Jailbreak Attack Initializations as Extractors of Compliance Directions

Amit LeVi; Rom Himelstein; Yaniv Nemcovsky; Avi Mendelson; Chaim Baskin

doi:10.18653/v1/2025.findings-emnlp.354

Jailbreak Attack Initializations as Extractors of Compliance Directions

Amit LeVi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin

Abstract

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model’s activation space. Recent studies have shown that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs.

Anthology ID:: 2025.findings-emnlp.354
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6672–6705
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.354/
DOI:: 10.18653/v1/2025.findings-emnlp.354
Bibkey:
Cite (ACL):: Amit LeVi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, and Chaim Baskin. 2025. Jailbreak Attack Initializations as Extractors of Compliance Directions. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6672–6705, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Jailbreak Attack Initializations as Extractors of Compliance Directions (LeVi et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.354.pdf
Checklist:: 2025.findings-emnlp.354.checklist.pdf

PDF Cite Search Checklist Fix data