Intention Analysis Makes LLMs A Good Jailbreak Defender

Yuqi Zhang; Liang Ding; Lefei Zhang; Dacheng Tao

Intention Analysis Makes LLMs A Good Jailbreak Defender

Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao

Abstract

Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis (IA). IA works by triggering LLMs’ inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably,IA is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that IA could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our IA, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, IA is robust to errors in generated intentions. Further analyses reveal the underlying principle of IA: suppressing LLM’s tendency to follow jailbreak prompts, thereby enhancing safety.

Anthology ID:: 2025.coling-main.199
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2947–2968
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.199/
DOI:
Bibkey:
Cite (ACL):: Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. 2025. Intention Analysis Makes LLMs A Good Jailbreak Defender. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2947–2968, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Intention Analysis Makes LLMs A Good Jailbreak Defender (Zhang et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.199.pdf

PDF Cite Search Fix data