Position: Scores Without Context? Rethinking the Role of Evaluation in the Era of LLMs

Jiawei Zhou


Abstract
Recent years have seen rapid growth in evaluation and benchmarking in NLP, driven by advances in large language models (LLMs). This growth has shifted evaluation from measuring generalization to tracking capability, often without reference to training assumptions. We argue that this creates a conceptual gap: results are frequently interpreted without considering what models could plausibly have learned, rendering many conclusions scientifically underdetermined. We propose an expectation-aware view, where the informativeness of evaluation depends on its relationship to training data, model design, and tasks. We further distinguish between evaluation for scientific understanding and capability tracking, and provide recommendations for aligning evaluation with its intended purpose in the LLM era.
Anthology ID:
2026.gem-main.82
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1048–1054
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.82/
DOI:
Bibkey:
Cite (ACL):
Jiawei Zhou. 2026. Position: Scores Without Context? Rethinking the Role of Evaluation in the Era of LLMs. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1048–1054, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Position: Scores Without Context? Rethinking the Role of Evaluation in the Era of LLMs (Zhou, GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.82.pdf