Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant; Shan Chen; Kuleen Sasse; Hugo Aerts; Thomas Hartvigsen; Danielle Bitterman

Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman

Abstract

Sparse Autoencoders (SAEs) provide potential for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications.

Anthology ID:: 2025.emnlp-main.1521
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29927–29951
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1521/
DOI:
Bibkey:
Cite (ACL):: Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, and Danielle Bitterman. 2025. Sparse Autoencoder Features for Classifications and Transferability. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29927–29951, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Sparse Autoencoder Features for Classifications and Transferability (Gallifant et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1521.pdf
Checklist:: 2025.emnlp-main.1521.checklist.pdf

PDF Cite Search Checklist Fix data