Xuemeng Weng


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Detoxifying Large Language Models via the Diversity of Toxic Samples
Ying Zhao | Yuanzhao Guo | Xuemeng Weng | Yuan Tian | Wei Wang | Yi Chang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Eliminating toxicity from Large Language Models (LLMs) is crucial for ensuring user safety. However, current methods have limitations in the analysis and utilization of toxic samples, failing to fully harness their potential. Through comparative analysis of toxic and safe samples, we discover that toxic samples exhibit diversity and, within this diversity, there lies specificity. These findings suggest that leveraging these characteristics of toxic samples could enhance the performance of algorithms in detoxifying LLMs. To this end, we propose a novel diverse detoxification framework, DivDetox, which comprises two innovative components: a Multi-Category-Induced Personalized Sample Generation (MPSG) strategy and a Scaled Contrastive DPO (SC-DPO) approach. The former is designed to elicit a variety of personalized toxic responses from the LLM, while the latter is constructed to precisely and fully utilize these toxic responses. Experiments on benchmark datasets across different model scales and different detoxification tasks verify the effectiveness of our architecture.