Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang; Zihao Li; Benyou Wang; Yan Hu; Difan Zou

Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou

Abstract

Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose **S**AE–Guided **S**ubspace **P**rojection **U**nlearning (**SSPU**), a novel framework that leverages SAE features to drive targeted updates in the model’s parameter space, enabling precise, interpretable, and robust unlearning. SSPU’s three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an “irrelevant” subspace while preserving retained knowledge. Overall, we use SAE features to construct a subspace that supervises unlearning, refining the loss and adding a regularization term to guide interpretable parameter updates. In experiments on the WMDP–Cyber forget set and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge accuracy by 3.22% compared to the strongest baseline. It also improves adversarial robustness, lowering malicious accuracy under jailbreak prompts compared to baselines. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.

Anthology ID:: 2025.emnlp-main.1348
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26541–26557
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1348/
DOI:
Bibkey:
Cite (ACL):: Xu Wang, Zihao Li, Benyou Wang, Yan Hu, and Difan Zou. 2025. Model Unlearning via Sparse Autoencoder Subspace Guided Projections. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26541–26557, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Model Unlearning via Sparse Autoencoder Subspace Guided Projections (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1348.pdf
Checklist:: 2025.emnlp-main.1348.checklist.pdf

PDF Cite Search Checklist Fix data