Can Out-of-Distribution Evaluations Uncover Reliance on Prediction Shortcuts? A Case Study in Question Answering

Michal Štefánik; Timothee Mickus; Michal Spiegel; Marek Kadlčík; Josef Kuchař

doi:10.18653/v1/2025.findings-emnlp.1232

Can Out-of-Distribution Evaluations Uncover Reliance on Prediction Shortcuts? A Case Study in Question Answering

Michal Štefánik, Timothee Mickus, Michal Spiegel, Marek Kadlčík, Josef Kuchař

Abstract

A large body of recent work assesses models’ generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts.We find that different datasets used for OOD evaluations in QA provide an estimate of models’ robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset’s quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.

Anthology ID:: 2025.findings-emnlp.1232
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22628–22635
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1232/
DOI:: 10.18653/v1/2025.findings-emnlp.1232
Bibkey:
Cite (ACL):: Michal Štefánik, Timothee Mickus, Michal Spiegel, Marek Kadlčík, and Josef Kuchař. 2025. Can Out-of-Distribution Evaluations Uncover Reliance on Prediction Shortcuts? A Case Study in Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22628–22635, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Can Out-of-Distribution Evaluations Uncover Reliance on Prediction Shortcuts? A Case Study in Question Answering (Štefánik et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1232.pdf
Checklist:: 2025.findings-emnlp.1232.checklist.pdf

PDF Cite Search Checklist Fix data