Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

Xanh Ho; Jiahao Huang; Florian Boudin; Akiko Aizawa

Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

Xanh Ho, Jiahao Huang, Florian Boudin, Akiko Aizawa

Abstract

Extractive QA tasks are commonly evaluated using Exact Match (EM) and F1-score, but these metrics often fail to reflect true model performance. Recent studies have proposed using large language models (LLMs) as judges (LLM-as-a-judge), yet they often lack comprehensive evaluation across datasets and overlook key factors such as sensitivity to answer types, prompt variations, and self-preference bias.In this work, we conduct a systematic study of LLM-as-a-judge across four extractive QA datasets and various prompt variations, assessing multiple LLM families in both answering and judging roles. Our results show that LLM-as-a-judge judgments correlate much more strongly with human evaluations than EM (0.22) and F1 (0.40), achieving correlations up to 0.85 with open-source models.Further analysis reveals that LLM-as-a-judge performs particularly well on number-related answers but faces challenges with more complex types, such as job titles. Contrary to findings in other NLP tasks, we observe no self-preference bias, even when the same model serves as both QA model and judge. Finally, we find that prompt phrasing has minimal impact, and zero-shot, context-free judging often yields the best evaluation performance.

Anthology ID:: 2026.gem-main.9
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 84–101
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.9/
DOI:
Bibkey:
Cite (ACL):: Xanh Ho, Jiahao Huang, Florian Boudin, and Akiko Aizawa. 2026. Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 84–101, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses (Ho et al., GEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.9.pdf

PDF Cite Search Fix data