Measuring and Mitigating Shortcut Reliance in Language Models with Probe-Based Representation Entanglement

Divyajot Singh

Measuring and Mitigating Shortcut Reliance in Language Models with Probe-Based Representation Entanglement

Abstract

Shortcut learning remains a major obstacle to robust NLP systems: models can achieve high in-distribution accuracy by relying on surface cues that fail under distribution shift. We study whether shortcut reliance can be diagnosed and mitigated in small instruction-tuned language models using a simple representation-level quantity. We fine-tune Gemma 3 1B Instruct and Llama 3.2 1B on two synthetic sentiment shortcuts in SST-2 and one natural shortcut in MNLI based on lexical overlap. During training, we fit linear probes for the task label and the shortcut attribute at every layer and define CDRE as the absolute cosine similarity between the two probe directions. Across settings, increasing shortcut prevalence produces a sharp rise in the robustness gap between shortcut-aligned and shortcut-free test sets, and higher deep-layer CDRE tracks this degradation. At a 99% shortcut ratio, Llama’s clean accuracy on capitalization-biased SST-2 drops from 93.2% at 0% bias to 49.0%, while Gemma drops from 91.8% to 60.2%. A CDRE-regularized objective substantially improves robustness for capitalization and lexical-overlap shortcuts, but offers little benefit for a speaker-prefix shortcut whose learned directions are already nearly orthogonal. These results show that probe-derived representation entanglement provides a reliable signal of harmful shortcut reliance and offers a practical criterion for determining when shortcut mitigation is likely to be effective.

Anthology ID:: 2026.acl-srw.59
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 648–662
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.59/
DOI:
Bibkey:
Cite (ACL):: Divyajot Singh. 2026. Measuring and Mitigating Shortcut Reliance in Language Models with Probe-Based Representation Entanglement. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 648–662, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Measuring and Mitigating Shortcut Reliance in Language Models with Probe-Based Representation Entanglement (Singh, ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.59.pdf

PDF Cite Search Fix data