Single-Layer Activation Edits Easily Corrupt Factual Recall but Rarely Repair It

Zacharie Bugaud


Abstract
Single-layer activation edits easily corrupt a language model’s correct factual answers but rarely repair its errors. On a curated factual-recall benchmark, corruption flips 70–100% of correct answers across three models, while twelve blind methods (no access to the correct answer) fix at most 6% within every evaluation pool. Per-instance gradient optimization ostensibly fixes 39%, but norm-constrained analysis reveals a magnitude artifact: at oracle-matched norms the fix rate drops to random, directions are nearly orthogonal to oracle directions (cos = -0.04), and collateral damage makes the net effect negative. An oracle ablation controlling for budget, target identity, and directional noise points to a direction-selection bottleneck: repair requires a precise, per-question direction that blind methods cannot locate. Target-informed methods partially succeed but none generalizes to unseen distributions.
Anthology ID:
2026.trustnlp-main.38
Volume:
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
515–527
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.38/
DOI:
Bibkey:
Cite (ACL):
Zacharie Bugaud. 2026. Single-Layer Activation Edits Easily Corrupt Factual Recall but Rarely Repair It. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 515–527, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Single-Layer Activation Edits Easily Corrupt Factual Recall but Rarely Repair It (Bugaud, TrustNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.38.pdf