Andrew M. Bean
2025
LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
Harry Mayne
|
Ryan Othniel Kearns
|
Yushi Yang
|
Andrew M. Bean
|
Eoin D. Delaney
|
Chris Russell
|
Adam Mahdi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.
2023
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values
Hannah Rose Kirk
|
Andrew M. Bean
|
Bertie Vidgen
|
Paul Röttger
|
Scott A. Hale
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. In this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the ACL and arXiv repositories. First, we summarise the past, pre-LLM trends for integrating human feedback into language models. Second, we give an overview of present techniques and practices, as well as the motivations for using feedback; conceptual frameworks for defining values and preferences; and how feedback is collected and from whom. Finally, we encourage a better future of feedback learning in LLMs by raising five unresolved conceptual and practical challenges.
Search
Fix author
Co-authors
- Eoin D. Delaney 1
- Scott A. Hale 1
- Ryan Othniel Kearns 1
- Hannah Rose Kirk 1
- Adam Mahdi 1
- show all...