Goal-Conditioned DPO: Prioritizing Safety in Misaligned Instructions

Joo Bon Maeng, Seongmin Lee, Seokin Seo, Kee-Eung Kim


Abstract
Large language models (LLMs) undergo extensive safety training to maximize both helpfulness and harmlessness in their responses. However, various jailbreak attacks jeopardize model safety, allowing malicious actors to bypass safety guidelines. Existing defense methods primarily focus on aligning the model’s output towards less harmful responses through post-processing or input perturbation. Consequently, these approaches are prone to general performance degradation and lack the ability to defend against a wide variety of attacks. In this paper, we propose goal-conditioned direct preference optimization (GC-DPO), which is trained to prioritize the system prompt over the user prompt through goal-conditioning, and thus enables a good balance between safety and performance. Empirically, we show that our approach significantly reduces the average Attack Success Rate (ASR) on a wide variety of jailbreak attacks. In particular, GC-DPO achieves a reduction of 67.1% to 5.0% in ASR for Vicuna-7B, a state-of-the-art result, without compromising the model’s general performance.
Anthology ID:
2025.naacl-long.369
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7196–7211
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-long.369/
DOI:
Bibkey:
Cite (ACL):
Joo Bon Maeng, Seongmin Lee, Seokin Seo, and Kee-Eung Kim. 2025. Goal-Conditioned DPO: Prioritizing Safety in Misaligned Instructions. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7196–7211, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Goal-Conditioned DPO: Prioritizing Safety in Misaligned Instructions (Maeng et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-long.369.pdf