Position: Evaluation Scores Are Perishable Knowledge Claims

Sankalp Gilda, Shlok Gilda


Abstract
Evaluation methodologies for language models increasingly combine multiple signals—automated metrics, LLM-as-judge ratings, human assessments, and benchmark suite results. When these signals are aggregated via averaging, the resulting evaluation confidence can substantially exceed the reliability of the weakest signal: a phenomenon we call trust inflation in evaluation. We argue that evaluation scores should be treated as epistemic claims with three properties: formality (human evaluation provides stronger evidence than an automated metric), scope (a benchmark result applies to the tested distribution, not universally), and validity windows (benchmark results expire as contamination accumulates and distributions shift). Drawing on several converging research traditions—chain-of-thought analysis, possibilistic logic, and algebraic theory—that establish weakest-link aggregation as the conservative endpoint of a parameterized operator family controlled by a single pessimism parameter, and on concrete lessons from building an evaluation harness for agentic AI, we propose that evaluation results carry explicit metadata—formality tier, scope declaration, and expiration date—to make their epistemic status transparent. We illustrate the cost of mean aggregation on the public HELM leaderboard: across 54 frontier models on ten scenarios, the top-five models ranked by mean score and by weakest-link are completely disjoint.
Anthology ID:
2026.gem-main.80
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1029–1035
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.80/
DOI:
Bibkey:
Cite (ACL):
Sankalp Gilda and Shlok Gilda. 2026. Position: Evaluation Scores Are Perishable Knowledge Claims. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1029–1035, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Position: Evaluation Scores Are Perishable Knowledge Claims (Gilda & Gilda, GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.80.pdf