Evaluating Sparse Autoencoders for Monosemantic Representation
Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju, A.b. Siddique
Abstract
A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen–Shannon distance, which captures how distinctly a neuron’s activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.- Anthology ID:
- 2026.findings-eacl.313
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5969–5984
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.313/
- DOI:
- Cite (ACL):
- Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju, and A.b. Siddique. 2026. Evaluating Sparse Autoencoders for Monosemantic Representation. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5969–5984, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Evaluating Sparse Autoencoders for Monosemantic Representation (Fereidouni et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.313.pdf