Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed; Sabrina Sadiekh; Chirag Agarwal

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

Abstract

Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.

Anthology ID:: 2026.findings-acl.1298
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26066–26086
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1298/
DOI:
Bibkey:
Cite (ACL):: Ahson Saiyed, Sabrina Sadiekh, and Chirag Agarwal. 2026. Towards Understanding the Robustness of Sparse Autoencoders. In Findings of the Association for Computational Linguistics: ACL 2026, pages 26066–26086, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Towards Understanding the Robustness of Sparse Autoencoders (Saiyed et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1298.pdf
Checklist:: 2026.findings-acl.1298.checklist.pdf

PDF Cite Search Checklist Fix data