Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang; Haoran Li; Hao Peng; Ziqian Zeng; Zihao Wang; Haohua Du; Zhengtao Yu (余正涛)

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu

Abstract

As Large Language Models (LLMs) become indispensable assistants, they remain vulnerable to misuse. Jailbreaking is an essential adversarial technique for red-teaming models to uncover and patch security flaws. However, existing jailbreak methods suffer from significant limitations. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose AGILE, a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a one-shot, scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage utilizes information from the model’s hidden states to guide fine-grained edits, effectively steering the model’s internal representation of the input from a malicious one toward a benign one. Extensive experiments demonstrate that AGILE achieves state-of-the-art Attack Success Rate, with gains of up to 37.74% over the strongest baseline, and AGILE exhibits excellent transferability to black-box and large-scale models. Our code is available at https://github.com/SELGroup/AGILE.

Anthology ID:: 2026.acl-long.801
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17614–17633
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.801/
DOI:
Bibkey:
Cite (ACL):: Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, and Zhengtao Yu. 2026. Activation-Guided Local Editing for Jailbreaking Attacks. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17614–17633, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Activation-Guided Local Editing for Jailbreaking Attacks (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.801.pdf
Checklist:: 2026.acl-long.801.checklist.pdf

PDF Cite Search Checklist Fix data