Dictionary Guided Sparse Logit Editing for Reliable Jailbreak Attacks
Shuaibiao Han, Ruiyang Ni, Zhiyu Yi, Changlong Li, Perley Xu, Wenjie Ruan
Abstract
Although Large Language Models undergo rigorous safety alignment, they remain vulnerable to adversarial attacks. Existing methods, particularly gradient-based prompt optimization, suffer from high computational costs and produce uninterpretable, high-perplexity inputs. While recent logit-space attacks improve efficiency, they often rely on cumbersome auxiliary models or complex pipelines. In this work, we propose Sparse Index-Based Intervention (SIBI), a white-box, inference-time jailbreak that bypasses guardrails via lightweight, sparse logit editing. SIBI operates without gradients or auxiliary models, modifying pre-softmax logits using a compact, tokenizer-aligned dictionary of penalty and reward tokens. By incorporating temperature-consistent scaling and a mixed-norm trust region, the method ensures attack effectiveness while preserving generation fluency. On standard benchmarks, SIBI achieves high attack success rates while reducing computational overhead and space overhead compared to optimization baselines.- Anthology ID:
- 2026.findings-acl.2137
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 43089–43107
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.2137/
- DOI:
- Cite (ACL):
- Shuaibiao Han, Ruiyang Ni, Zhiyu Yi, Changlong Li, Perley Xu, and Wenjie Ruan. 2026. Dictionary Guided Sparse Logit Editing for Reliable Jailbreak Attacks. In Findings of the Association for Computational Linguistics: ACL 2026, pages 43089–43107, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Dictionary Guided Sparse Logit Editing for Reliable Jailbreak Attacks (Han et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.2137.pdf