Aligning What LLMs Do and Say: Towards Self-Consistent Explanations

Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser


Abstract
Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their answers. Yet the features driving an answer often differ from those emphasized in its explanation, meaning post-hoc rationales can misrepresent what actually shaped the model’s output. We quantify this gap by comparing the feature-importance distributions of answers and their explanations. Prior analyses reveal such discrepancies, but large-scale study has been limited by the high computational cost of attribution methods. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB), a large-scale benchmark linking model decisions with diverse explanations and attribution vectors across datasets, methods, and model families. Using PSCB, we find that Spearman rank correlation provides a more reliable signal of alignment than cosine similarity. Building on this insight, we apply Direct Preference Optimization (DPO) to attribution-based preference data, improving alignment without degrading task accuracy, and show that standard supervised fine-tuning on the same data fails to achieve comparable gains. These improvements generalize robustly across domains, paving the way toward scalable and faithful alignment between LLM decisions and their natural language explanations.
Anthology ID:
2026.findings-acl.49
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
987–1003
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.49/
DOI:
Bibkey:
Cite (ACL):
Sahar Admoni, Ofra Amir, Assaf Hallak, and Yftah Ziser. 2026. Aligning What LLMs Do and Say: Towards Self-Consistent Explanations. In Findings of the Association for Computational Linguistics: ACL 2026, pages 987–1003, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Aligning What LLMs Do and Say: Towards Self-Consistent Explanations (Admoni et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.49.pdf
Checklist:
 2026.findings-acl.49.checklist.pdf