Blind Single-Layer Activation Edits Show a Break/Fix Asymmetry in Factual Recall

Zacharie Bugaud


Abstract
Can factual errors in language models be repaired by editing a single hidden activation at inference time?We compare blind edits, which are not told the correct answer, with oracle edits that receive answer-specific information.On Pythia-6.9B, with corruption replicated on Pythia-1B and GPT-2 XL, we find a strong break/fix asymmetry: single-layer perturbations easily corrupt correct factual recall, flipping 74-100% of initially correct answers, but blind repair is much harder.On EntityConfusion, twelve blind non-gradient interventions from four families fail to repair stable hallucinations in the strict single-layer setting; relaxed multi-layer or multi-head variants improve net accuracy by only +3 percentage points.Blind gradient optimization repairs more errors, but often breaks already-correct answers.In contrast, oracle edits given the correct answer repair many more hallucinations, fixing 68% at the default layer and up to 82% at a better layer.These results suggest that the main barrier is not whether factual recall can be steered, but whether a blind method can identify the right target-specific direction.TriviaQA is a boundary case: blind confidence maximization outperforms the single-token oracle, but the comparison is complicated because evaluation accepts multiple aliases.
Anthology ID:
2026.knowfm-1.2
Volume:
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Canyu Chen, Yuji Zhang, Zoey Sha Li, Zihan Wang, Qineng Wang, Jinyan Su, Priyanka Kargupta, Sara Vera Marjanović, Jeff Z. Pan, Mohit Bansal, Isabelle Augenstein, Jiawei Han, Heng Ji, Manling Li
Venues:
KnowFM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13–24
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.knowfm-1.2/
DOI:
Bibkey:
Cite (ACL):
Zacharie Bugaud. 2026. Blind Single-Layer Activation Edits Show a Break/Fix Asymmetry in Factual Recall. In Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026), pages 13–24, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Blind Single-Layer Activation Edits Show a Break/Fix Asymmetry in Factual Recall (Bugaud, KnowFM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.knowfm-1.2.pdf