Christian Giannetti
2026
The Mechanics of Interference: Defusing Distractors in RAG via Sparse Autoencoder Interventions
Christian Giannetti | Giovanni Trappolini | Nicola Tonellotto | Fabrizio Silvestri | Pietro Lio
Findings of the Association for Computational Linguistics: ACL 2026
Christian Giannetti | Giovanni Trappolini | Nicola Tonellotto | Fabrizio Silvestri | Pietro Lio
Findings of the Association for Computational Linguistics: ACL 2026
Large language models exhibit a critical vulnerability to distractor interference in retrieval-augmented contexts: they fail to prioritize relevant, factually correct documents over topically similar but misleading content. We introduce Lat-Defuse, a mechanistic framework that corrects this failure mode through targeted interventions in the model’s latent space. Using Sparse Autoencoders (SAEs), our method operates in an interpretable feature space and formulates correction as constrained counterfactual optimization. On Gemma-2 and Llama-3 model families across three QA benchmarks (BioASQ, Natural Questions, PopQA), our method achieves recovery rates of up to 94% on distractor-vulnerable samples. Successful correction through sparse modifications reveals distractor interference as a localized, systematically addressable phenomenon, opening directions toward universal distractor robustness in LLMs.