Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety

Yiwei Wang, Muhao Chen, Nanyun Peng, Kai-Wei Chang


Abstract
Previous research on jailbreak attacks has mainly focused on optimizing the adversarial snippet content injected into input prompts to expose LLM security vulnerabilities. A significant portion of this research focuses on developing more complex, less readable adversarial snippets that can achieve higher attack success rates. In contrast to this trend, our research investigates the impact of the adversarial snippet’s position on the effectiveness of jailbreak attacks. We find that placing a simple and readable adversarial snippet at the beginning of the output effectively exposes LLM safety vulnerabilities, leading to much higher attack success rates than the input suffix attack or prompt-based output jailbreaks. Precisely speaking, we discover that directly enforcing the user’s target embedded output prefix is an effective method to expose LLMs’ safety vulnerabilities.
Anthology ID:
2025.findings-naacl.219
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3939–3952
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.219/
DOI:
Bibkey:
Cite (ACL):
Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai-Wei Chang. 2025. Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3939–3952, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.219.pdf