Filip Sondej


Fixing paper assignments

  1. Please select all papers that do not belong to this person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
Yushi Yang | Filip Sondej | Harry Mayne | Andrew Lee | Adam Mahdi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Safety fine-tuning algorithms reduce harmful outputs in language models, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations—attributing its effects solely to dampened toxic neurons in the MLP layers—are incomplete. In this study, we analyse four language models (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5% to 24% of DPO’s effects across models. Instead, DPO induces distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups—two aligned with reducing toxicity and two promoting anti-toxicity—whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method that mimics DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.