Steering Away from Refusal: A Black-box Jailbreak Method Based on First-Token Distribution
Shuangjie Fu, Du Su, Xin Chen, Fei Sun, Huawei Shen, Xueqi Cheng
Abstract
Investigating black-box jailbreak attacks is crucial for revealing the actual security risks faced by operational Large Language Models (LLMs). The primary challenge in black-box jailbreak attack is the absence of direct optimization signals, such as gradients, to guide the refinement of adversarial prompts. While current mainstream methods like PAIR and TAP attempt to leverage the model’s textual output as feedback, facing a critical limitation when models consistently generate static refusal responses, depriving the attacker of any actionable signal to distinguish better prompts. To overcome the bottleneck and reveal whether there is potential risk to open access to partial logprobs information, we investigate LLM output distribution. Our empirical analysis reveals that refusal responses exhibit a highly consistent distributional pattern at the first generated token, suggesting that the deviation from this standard pattern can serve as a quantifiable metric for LLM generating refusal response. Based on this insight, we propose Distribution Jailbreak (DJ), an attack method that select effective jailbreak templates and then iteratively optimizes adversarial suffixes by maximizing the KL divergence from the standard refusal distribution. Extensive experiments demonstrate that DJ achieves state-of-the-art Attack Success Rate(ASR). Notably, DJ achieves over 90% ASR on all tested open-source models, and delivers over 94% ASR on GPT-4.1. Our code is publicly available at https://github.com/Zed630/DistributionJailbreak.- Anthology ID:
- 2026.findings-acl.1294
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 25969–25979
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1294/
- DOI:
- Cite (ACL):
- Shuangjie Fu, Du Su, Xin Chen, Fei Sun, Huawei Shen, and Xueqi Cheng. 2026. Steering Away from Refusal: A Black-box Jailbreak Method Based on First-Token Distribution. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25969–25979, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Steering Away from Refusal: A Black-box Jailbreak Method Based on First-Token Distribution (Fu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1294.pdf