Evaluating Sparse Autoencoders for Monosemantic Representation

Moghis Fereidouni; Muhammad Umair Haider; Peizhong Ju; A.b. Siddique

Evaluating Sparse Autoencoders for Monosemantic Representation

Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju, A.b. Siddique

Abstract

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen–Shannon distance, which captures how distinctly a neuron’s activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.

Anthology ID:: 2026.findings-eacl.313
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5969–5984
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.313/
DOI:
Bibkey:
Cite (ACL):: Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju, and A.b. Siddique. 2026. Evaluating Sparse Autoencoders for Monosemantic Representation. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5969–5984, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Evaluating Sparse Autoencoders for Monosemantic Representation (Fereidouni et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.313.pdf
Checklist:: 2026.findings-eacl.313.checklist.pdf

PDF Cite Search Checklist Fix data