How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Yushi Yang; Filip Sondej; Harry Mayne; Andrew Lee; Adam Mahdi

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, Adam Mahdi

Abstract

Safety fine-tuning algorithms reduce harmful outputs in language models, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations—attributing its effects solely to dampened toxic neurons in the MLP layers—are incomplete. In this study, we analyse four language models (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5% to 24% of DPO’s effects across models. Instead, DPO induces distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups—two aligned with reducing toxicity and two promoting anti-toxicity—whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method that mimics DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.

Anthology ID:: 2025.emnlp-main.1501
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29512–29531
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1501/
DOI:
Bibkey:
Cite (ACL):: Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, and Adam Mahdi. 2025. How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29512–29531, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis (Yang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1501.pdf
Checklist:: 2025.emnlp-main.1501.checklist.pdf

PDF Cite Search Checklist Fix data