Detoxifying Large Language Models via the Diversity of Toxic Samples
Ying Zhao, Yuanzhao Guo, Xuemeng Weng, Yuan Tian, Wei Wang, Yi Chang
Abstract
Eliminating toxicity from Large Language Models (LLMs) is crucial for ensuring user safety. However, current methods have limitations in the analysis and utilization of toxic samples, failing to fully harness their potential. Through comparative analysis of toxic and safe samples, we discover that toxic samples exhibit diversity and, within this diversity, there lies specificity. These findings suggest that leveraging these characteristics of toxic samples could enhance the performance of algorithms in detoxifying LLMs. To this end, we propose a novel diverse detoxification framework, DivDetox, which comprises two innovative components: a Multi-Category-Induced Personalized Sample Generation (MPSG) strategy and a Scaled Contrastive DPO (SC-DPO) approach. The former is designed to elicit a variety of personalized toxic responses from the LLM, while the latter is constructed to precisely and fully utilize these toxic responses. Experiments on benchmark datasets across different model scales and different detoxification tasks verify the effectiveness of our architecture.- Anthology ID:
- 2025.emnlp-main.298
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5869–5882
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.298/
- DOI:
- Cite (ACL):
- Ying Zhao, Yuanzhao Guo, Xuemeng Weng, Yuan Tian, Wei Wang, and Yi Chang. 2025. Detoxifying Large Language Models via the Diversity of Toxic Samples. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5869–5882, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Detoxifying Large Language Models via the Diversity of Toxic Samples (Zhao et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.298.pdf