Jiecong Wang
2026
Activation-Guided Local Editing for Jailbreaking Attacks
Jiecong Wang | Haoran Li | Hao Peng | Ziqian Zeng | Zihao Wang | Haohua Du | Zhengtao Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiecong Wang | Haoran Li | Hao Peng | Ziqian Zeng | Zihao Wang | Haohua Du | Zhengtao Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As Large Language Models (LLMs) become indispensable assistants, they remain vulnerable to misuse. Jailbreaking is an essential adversarial technique for red-teaming models to uncover and patch security flaws. However, existing jailbreak methods suffer from significant limitations. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose AGILE, a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a one-shot, scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage utilizes information from the model’s hidden states to guide fine-grained edits, effectively steering the model’s internal representation of the input from a malicious one toward a benign one. Extensive experiments demonstrate that AGILE achieves state-of-the-art Attack Success Rate, with gains of up to 37.74% over the strongest baseline, and AGILE exhibits excellent transferability to black-box and large-scale models. Our code is available at https://github.com/SELGroup/AGILE.