A Mechanistic Perspective and Difficulty Metric for Unlearning

Jiali Cheng, Ziheng Chen, Chirag Agarwal, Hadi Amiri


Abstract
Machine unlearning is becoming essential for building trustworthy and compliant language models. Yet unlearning success varies considerably across individual samples: some are reliably erased, while others persist despite the same procedure. We argue that this disparity is not only a data-side phenomenon, but also reflects model-internal mechanisms that encode and protect memorized information. We study this problem from a mechanistic perspective based on model circuits–structured interaction pathways that govern how predictions are formed. We propose Circuit-guided Unlearning Difficulty (), a pre-unlearning metric that assigns each sample a continuous difficulty score using circuit-level signals. Extensive experiments demonstrate that reliably separates intrinsically easy and hard samples, and remains stable across unlearning methods. We identify key circuit-level patterns that reveal a mechanistic signature of unlearning difficulty: easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate parts of the original model, whereas hard-to-unlearn samples rely on longer and deeper pathways closer to late-stage computation. Compared to existing qualitative studies, takes a first step toward a principled, fine-grained, and interpretable analysis of unlearning difficulty; and motivates the development of unlearning methods grounded in model mechanisms.
Anthology ID:
2026.findings-acl.532
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10950–10964
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.532/
DOI:
Bibkey:
Cite (ACL):
Jiali Cheng, Ziheng Chen, Chirag Agarwal, and Hadi Amiri. 2026. A Mechanistic Perspective and Difficulty Metric for Unlearning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10950–10964, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
A Mechanistic Perspective and Difficulty Metric for Unlearning (Cheng et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.532.pdf
Checklist:
 2026.findings-acl.532.checklist.pdf