Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du


Abstract
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16% across six challenging concepts, while maintaining topic relevance.
Anthology ID:
2026.findings-eacl.40
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
797–808
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.40/
DOI:
Bibkey:
Cite (ACL):
Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, and Mengnan Du. 2026. Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering. In Findings of the Association for Computational Linguistics: EACL 2026, pages 797–808, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering (Zhao et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.40.pdf
Checklist:
 2026.findings-eacl.40.checklist.pdf