Yushi Yang
2025
LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
Harry Mayne
|
Ryan Othniel Kearns
|
Yushi Yang
|
Andrew M. Bean
|
Eoin D. Delaney
|
Chris Russell
|
Adam Mahdi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
Yushi Yang
|
Filip Sondej
|
Harry Mayne
|
Andrew Lee
|
Adam Mahdi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Safety fine-tuning algorithms reduce harmful outputs in language models, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations—attributing its effects solely to dampened toxic neurons in the MLP layers—are incomplete. In this study, we analyse four language models (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5% to 24% of DPO’s effects across models. Instead, DPO induces distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups—two aligned with reducing toxicity and two promoting anti-toxicity—whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method that mimics DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.
Search
Fix author
Co-authors
- Adam Mahdi 2
- Harry Mayne 2
- Andrew M. Bean 1
- Eoin D. Delaney 1
- Ryan Othniel Kearns 1
- show all...