Trustworthy and Explainable Causal Representation Learning in Transformers
Yang Liu, Yinghao Zhang, Lin Liu, Jiuyong Li, Debo Cheng, Zaiwen Feng
Abstract
A prevalent approach to interpretable representation learning involves creating a mask that weights the significance of each input feature, followed by deriving a masked representation by applying this mask to the input representation. However, the identifiability of these learned masked representations is often uncertain, making the origin of these representations ambiguous or unreliable. Furthermore, the approaches to interpreting Transformer based on attention weights have been criticized for their faithfulness. To address these limitations, we propose a novel causal framework that directly learns identifiable and explainable representations from attention weights, rather than relying on importance masks. Our framework leverages identifiability theory and causal representation learning to extract explainable representations within a subspace of input representations, effectively transforming frozen representation learning methods into self-explaining systems. Experimental results on real-world datasets demonstrate that, compared to well-established state-of-the-art methods, our approach provides identifiable and more trustworthy explanations while guaranteeing faithfulness.- Anthology ID:
- 2026.findings-acl.1368
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27482–27501
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1368/
- DOI:
- Cite (ACL):
- Yang Liu, Yinghao Zhang, Lin Liu, Jiuyong Li, Debo Cheng, and Zaiwen Feng. 2026. Trustworthy and Explainable Causal Representation Learning in Transformers. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27482–27501, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Trustworthy and Explainable Causal Representation Learning in Transformers (Liu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1368.pdf