Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo; Nirmalendu Prakash; Clement Neo; Ranjan Satapathy; Roy Ka-Wei Lee; Erik Cambria

doi:10.18653/v1/2025.findings-emnlp.338

Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Ranjan Satapathy, Roy Ka-Wei Lee, Erik Cambria

Abstract

Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks.

Anthology ID:: 2025.findings-emnlp.338
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6377–6399
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.338/
DOI:: 10.18653/v1/2025.findings-emnlp.338
Bibkey:
Cite (ACL):: Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Ranjan Satapathy, Roy Ka-Wei Lee, and Erik Cambria. 2025. Understanding Refusal in Language Models with Sparse Autoencoders. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6377–6399, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Understanding Refusal in Language Models with Sparse Autoencoders (Yeo et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.338.pdf
Checklist:: 2025.findings-emnlp.338.checklist.pdf

PDF Cite Search Checklist Fix data