SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding; Wen Sun; Dailin Li; Wei Zou; Jiaming Wang; Jiajun Chen; Shujian Huang (书剑 黄)

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang

Abstract

Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model’s inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model’s own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs’ discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model’s generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.

Anthology ID:: 2025.emnlp-main.253
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5023–5037
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.253/
DOI:
Bibkey:
Cite (ACL):: Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, and Shujian Huang. 2025. SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5023–5037, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models (Ding et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.253.pdf
Checklist:: 2025.emnlp-main.253.checklist.pdf

PDF Cite Search Checklist Fix data