RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

Peiran Wang, Xiaogeng Liu, Chaowei Xiao


Abstract
In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pre-training and fine-tuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user’s original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user’s prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
Anthology ID:
2025.findings-naacl.16
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
283–294
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.findings-naacl.16/
DOI:
Bibkey:
Cite (ACL):
Peiran Wang, Xiaogeng Liu, and Chaowei Xiao. 2025. RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 283–294, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.findings-naacl.16.pdf