Activation-Guided Local Editing for Jailbreaking Attacks
Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu
Abstract
As Large Language Models (LLMs) become indispensable assistants, they remain vulnerable to misuse. Jailbreaking is an essential adversarial technique for red-teaming models to uncover and patch security flaws. However, existing jailbreak methods suffer from significant limitations. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose AGILE, a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a one-shot, scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage utilizes information from the model’s hidden states to guide fine-grained edits, effectively steering the model’s internal representation of the input from a malicious one toward a benign one. Extensive experiments demonstrate that AGILE achieves state-of-the-art Attack Success Rate, with gains of up to 37.74% over the strongest baseline, and AGILE exhibits excellent transferability to black-box and large-scale models. Our code is available at https://github.com/SELGroup/AGILE.- Anthology ID:
- 2026.acl-long.801
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17614–17633
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.801/
- DOI:
- Cite (ACL):
- Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, and Zhengtao Yu. 2026. Activation-Guided Local Editing for Jailbreaking Attacks. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17614–17633, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Activation-Guided Local Editing for Jailbreaking Attacks (Wang et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.801.pdf