Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Dong Shu, Xuansheng Wu, Haiyan Zhao, Mengnan Du, Ninghao Liu


Abstract
Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the influence between each latent feature and the model’s output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model’s output, and (2) only latents with high influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.
Anthology ID:
2025.emnlp-main.87
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1673–1682
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.87/
DOI:
Bibkey:
Cite (ACL):
Dong Shu, Xuansheng Wu, Haiyan Zhao, Mengnan Du, and Ninghao Liu. 2025. Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1673–1682, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders (Shu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.87.pdf
Checklist:
 2025.emnlp-main.87.checklist.pdf