Trustworthy and Explainable Causal Representation Learning in Transformers

Yang Liu; Yinghao Zhang; Lin Liu; Jiuyong Li; Debo Cheng; Zaiwen Feng

Trustworthy and Explainable Causal Representation Learning in Transformers

Yang Liu, Yinghao Zhang, Lin Liu, Jiuyong Li, Debo Cheng, Zaiwen Feng

Abstract

A prevalent approach to interpretable representation learning involves creating a mask that weights the significance of each input feature, followed by deriving a masked representation by applying this mask to the input representation. However, the identifiability of these learned masked representations is often uncertain, making the origin of these representations ambiguous or unreliable. Furthermore, the approaches to interpreting Transformer based on attention weights have been criticized for their faithfulness. To address these limitations, we propose a novel causal framework that directly learns identifiable and explainable representations from attention weights, rather than relying on importance masks. Our framework leverages identifiability theory and causal representation learning to extract explainable representations within a subspace of input representations, effectively transforming frozen representation learning methods into self-explaining systems. Experimental results on real-world datasets demonstrate that, compared to well-established state-of-the-art methods, our approach provides identifiable and more trustworthy explanations while guaranteeing faithfulness.

Anthology ID:: 2026.findings-acl.1368
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27482–27501
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1368/
DOI:
Bibkey:
Cite (ACL):: Yang Liu, Yinghao Zhang, Lin Liu, Jiuyong Li, Debo Cheng, and Zaiwen Feng. 2026. Trustworthy and Explainable Causal Representation Learning in Transformers. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27482–27501, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Trustworthy and Explainable Causal Representation Learning in Transformers (Liu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1368.pdf
Checklist:: 2026.findings-acl.1368.checklist.pdf

PDF Cite Search Checklist Fix data