Rom Himelstein

2025

pdf bib abs
Jailbreak Attack Initializations as Extractors of Compliance Directions
Amit LeVi | Rom Himelstein | Yaniv Nemcovsky | Avi Mendelson | Chaim Baskin
Findings of the Association for Computational Linguistics: EMNLP 2025

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model’s activation space. Recent studies have shown that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs.

Co-authors

Venues

findings1

Fix data

Rom Himelstein

Fixing paper assignments

2025

Co-authors

Venues