Mechanistic Interpretability Should Prioritize Feature Consistency in Sparse Autoencoders

Xiangchen Song; Aashiq Muhamed; Yujia Zheng; Lingjing Kong; Zeyu Tang; Mona Diab; Virginia Smith; Kun Zhang

Mechanistic Interpretability Should Prioritize Feature Consistency in Sparse Autoencoders

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang

Abstract

Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining reproducibility and complicating model comparison. We study run-to-run feature consistency in SAEs and argue that it should be reported as a standard evaluation axis alongside reconstruction and sparsity. We propose the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as an assignment-based metric to quantify consistency and demonstrate that high levels are achievable (PW-MCC ≈ 0.80 for TopK SAEs on LLM activations) with appropriate architectural choices.Our contributions include: (i) theoretical grounding for strong consistency in the idealized setting of TopK SAEs; (ii) synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and (iii) empirical analysis on LLM activations, where PW-MCC correlates with the similarity of automatically generated natural-language feature explanations.

Anthology ID:: 2026.acl-long.99
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2172–2210
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.99/
DOI:
Bibkey:
Cite (ACL):: Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. 2026. Mechanistic Interpretability Should Prioritize Feature Consistency in Sparse Autoencoders. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2172–2210, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Mechanistic Interpretability Should Prioritize Feature Consistency in Sparse Autoencoders (Song et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.99.pdf
Checklist:: 2026.acl-long.99.checklist.pdf

PDF Cite Search Checklist Fix data