Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks

Chen Xiong, Xiangyu Qi, Pin-Yu Chen, Tsung-Yi Ho


Abstract
Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models’ safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces **Defensive Prompt Patch** (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on Llama-2-7B-Chat and Mistral-7B-Instruct-v0.2 demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and robust solution to various LLM platforms.
Anthology ID:
2025.findings-acl.23
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
409–437
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.23/
DOI:
Bibkey:
Cite (ACL):
Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. 2025. Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks. In Findings of the Association for Computational Linguistics: ACL 2025, pages 409–437, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks (Xiong et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.23.pdf