Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Jun Zhuang; Haibo Jin; Ye Zhang; Zhengjian Kang; Wenbin Zhang; Gaby G. Dagher; Haohan Wang

doi:10.18653/v1/2025.findings-emnlp.114

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang

Abstract

Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs’ moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our “FSTR+SPIN” variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs’ safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

Anthology ID:: 2025.findings-emnlp.114
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2147–2160
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.114/
DOI:: 10.18653/v1/2025.findings-emnlp.114
Bibkey:
Cite (ACL):: Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, and Haohan Wang. 2025. Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2147–2160, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation (Zhuang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.114.pdf
Checklist:: 2025.findings-emnlp.114.checklist.pdf

PDF Cite Search Checklist Fix data