Detoxifying Large Language Models via the Diversity of Toxic Samples

Ying Zhao; Yuanzhao Guo; Xuemeng Weng; Yuan Tian; Wei Wang (王巍); Yi Chang

Detoxifying Large Language Models via the Diversity of Toxic Samples

Ying Zhao, Yuanzhao Guo, Xuemeng Weng, Yuan Tian, Wei Wang, Yi Chang

Abstract

Eliminating toxicity from Large Language Models (LLMs) is crucial for ensuring user safety. However, current methods have limitations in the analysis and utilization of toxic samples, failing to fully harness their potential. Through comparative analysis of toxic and safe samples, we discover that toxic samples exhibit diversity and, within this diversity, there lies specificity. These findings suggest that leveraging these characteristics of toxic samples could enhance the performance of algorithms in detoxifying LLMs. To this end, we propose a novel diverse detoxification framework, DivDetox, which comprises two innovative components: a Multi-Category-Induced Personalized Sample Generation (MPSG) strategy and a Scaled Contrastive DPO (SC-DPO) approach. The former is designed to elicit a variety of personalized toxic responses from the LLM, while the latter is constructed to precisely and fully utilize these toxic responses. Experiments on benchmark datasets across different model scales and different detoxification tasks verify the effectiveness of our architecture.

Anthology ID:: 2025.emnlp-main.298
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5869–5882
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.298/
DOI:
Bibkey:
Cite (ACL):: Ying Zhao, Yuanzhao Guo, Xuemeng Weng, Yuan Tian, Wei Wang, and Yi Chang. 2025. Detoxifying Large Language Models via the Diversity of Toxic Samples. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5869–5882, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Detoxifying Large Language Models via the Diversity of Toxic Samples (Zhao et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.298.pdf
Checklist:: 2025.emnlp-main.298.checklist.pdf

PDF Cite Search Checklist Fix data