Zaiwen Feng
2026
Trustworthy and Explainable Causal Representation Learning in Transformers
Yang Liu | Yinghao Zhang | Lin Liu | Jiuyong Li | Debo Cheng | Zaiwen Feng
Findings of the Association for Computational Linguistics: ACL 2026
Yang Liu | Yinghao Zhang | Lin Liu | Jiuyong Li | Debo Cheng | Zaiwen Feng
Findings of the Association for Computational Linguistics: ACL 2026
A prevalent approach to interpretable representation learning involves creating a mask that weights the significance of each input feature, followed by deriving a masked representation by applying this mask to the input representation. However, the identifiability of these learned masked representations is often uncertain, making the origin of these representations ambiguous or unreliable. Furthermore, the approaches to interpreting Transformer based on attention weights have been criticized for their faithfulness. To address these limitations, we propose a novel causal framework that directly learns identifiable and explainable representations from attention weights, rather than relying on importance masks. Our framework leverages identifiability theory and causal representation learning to extract explainable representations within a subspace of input representations, effectively transforming frozen representation learning methods into self-explaining systems. Experimental results on real-world datasets demonstrate that, compared to well-established state-of-the-art methods, our approach provides identifiable and more trustworthy explanations while guaranteeing faithfulness.