Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne


Abstract
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.
Anthology ID:
2026.acl-long.1895
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40849–40868
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1895/
DOI:
Bibkey:
Cite (ACL):
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, and Bill Byrne. 2026. Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 40849–40868, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models (Yang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1895.pdf
Checklist:
 2026.acl-long.1895.checklist.pdf