Can LLMs Explain Themselves Counterfactually?

Zahra Dehghanighobadi, Asja Fischer, Muhammad Bilal Zafar


Abstract
Explanations are an important tool for gaining insights into model behavior, calibrating user trust, and ensuring compliance.Past few years have seen a flurry of methods for generating explanations, many of which involve computing model gradients or solving specially designed optimization problems.Owing to the remarkable reasoning abilities of LLMs, *self-explanation*, i.e., prompting the model to explain its outputs has recently emerged as a new paradigm.We study a specific type of self-explanations, *self-generated counterfactual explanations* (SCEs).We test LLMs’ ability to generate SCEs across families, sizes, temperatures, and datasets. We find that LLMs sometimes struggle to generate SCEs. When they do, their prediction often does not agree with their own counterfactual reasoning.
Anthology ID:
2025.emnlp-main.396
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7798–7826
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.396/
DOI:
Bibkey:
Cite (ACL):
Zahra Dehghanighobadi, Asja Fischer, and Muhammad Bilal Zafar. 2025. Can LLMs Explain Themselves Counterfactually?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7798–7826, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Can LLMs Explain Themselves Counterfactually? (Dehghanighobadi et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.396.pdf
Checklist:
 2025.emnlp-main.396.checklist.pdf