Rejection-to-Acceptance Transition: Model Editing-Based Jailbreak Backdoor Injection Not Limited to Few Output Tokens

Shiji Yang, Min Cai, Hao Xiong, Congyao Mei, Haodong Zou, Shicheng Tan, Jie Chen, Fulan Qian, Shu Zhao


Abstract
Model editing-based jailbreak backdoor attacks against LLMs have gained attention for being lightweight, enabling vulnerability discovery in LLMs. Existing methods are implemented by binding backdoors to predefined phrases as first few output tokens, inducing the LLM’s next-token prediction to produce continuous responses. However, their effectiveness is heavily dependent on the number of bound phrases, with attack costs rising as this number increases. In this work, we propose JEST, which achieves jailbreak backdoor attacks by hijacking LLM representations into a acceptance domain rather than binding to a few output tokens. Specifically, we propose a representation transition-guided model editing to inject jailbreak backdoors into LLMs. The activated backdoor transitions the LLM from rejection domain to acceptance domain, causing it to accept and generate jailbreak behavior. To clearly distinguish between rejection and acceptance domains within LLMs, we also design a domain modeling strategy for JEST that models these two opposing domains within the representation space. Additionally, JEST-hijacked LLMs exhibit greater vulnerability to direct prompt attacks. Experimental results show that JEST outperforms existing model editing methods, demonstrating stronger jailbreak capabilities across various LLMs and datasets. We also provide analysis to explore the safety boundary of LLM.
Anthology ID:
2026.findings-acl.1625
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32463–32477
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1625/
DOI:
Bibkey:
Cite (ACL):
Shiji Yang, Min Cai, Hao Xiong, Congyao Mei, Haodong Zou, Shicheng Tan, Jie Chen, Fulan Qian, and Shu Zhao. 2026. Rejection-to-Acceptance Transition: Model Editing-Based Jailbreak Backdoor Injection Not Limited to Few Output Tokens. In Findings of the Association for Computational Linguistics: ACL 2026, pages 32463–32477, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Rejection-to-Acceptance Transition: Model Editing-Based Jailbreak Backdoor Injection Not Limited to Few Output Tokens (Yang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1625.pdf
Checklist:
 2026.findings-acl.1625.checklist.pdf