Intrinsic Test of Unlearning Using Parametric Knowledge Traces

Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, Mor Geva


Abstract
The task of “unlearning” certain concepts in large language models (LLMs) has gained attention for its role in mitigating harmful, private, or incorrect outputs. Current evaluations mostly rely on behavioral tests, without monitoring residual knowledge in model parameters, which can be adversarially exploited to recover erased information. We argue that unlearning should also be assessed internally by tracking changes in the parametric traces of unlearned concepts. To this end, we propose a general evaluation methodology that uses vocabulary projections to inspect concepts encoded in model parameters. We apply this approach to localize “concept vectors” — parameter vectors encoding concrete concepts — and construct ConceptVectors, a benchmark of hundreds of such concepts and their parametric traces in two open-source LLMs. Evaluation on ConceptVectors shows that existing methods minimally alter concept vectors, mostly suppressing them at inference time, while direct ablation of these vectors removes the associated knowledge and reduces adversarial susceptibility. Our findings reveal limitations of behavior-only evaluations and advocate for parameter-based assessments. We release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
Anthology ID:
2025.emnlp-main.985
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19524–19546
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.985/
DOI:
Bibkey:
Cite (ACL):
Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, and Mor Geva. 2025. Intrinsic Test of Unlearning Using Parametric Knowledge Traces. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19524–19546, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Intrinsic Test of Unlearning Using Parametric Knowledge Traces (Hong et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.985.pdf
Checklist:
 2025.emnlp-main.985.checklist.pdf