Understanding Refusal in Language Models with Sparse Autoencoders
Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Ranjan Satapathy, Roy Ka-Wei Lee, Erik Cambria
Abstract
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks.- Anthology ID:
- 2025.findings-emnlp.338
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6377–6399
- Language:
- URL:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.338/
- DOI:
- 10.18653/v1/2025.findings-emnlp.338
- Cite (ACL):
- Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Ranjan Satapathy, Roy Ka-Wei Lee, and Erik Cambria. 2025. Understanding Refusal in Language Models with Sparse Autoencoders. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6377–6399, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Understanding Refusal in Language Models with Sparse Autoencoders (Yeo et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.338.pdf