Zahra Dehghanighobadi


2025

pdf bib
Can LLMs Explain Themselves Counterfactually?
Zahra Dehghanighobadi | Asja Fischer | Muhammad Bilal Zafar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Explanations are an important tool for gaining insights into model behavior, calibrating user trust, and ensuring compliance.Past few years have seen a flurry of methods for generating explanations, many of which involve computing model gradients or solving specially designed optimization problems.Owing to the remarkable reasoning abilities of LLMs, *self-explanation*, i.e., prompting the model to explain its outputs has recently emerged as a new paradigm.We study a specific type of self-explanations, *self-generated counterfactual explanations* (SCEs).We test LLMs’ ability to generate SCEs across families, sizes, temperatures, and datasets. We find that LLMs sometimes struggle to generate SCEs. When they do, their prediction often does not agree with their own counterfactual reasoning.