Graduating the Benchmark Scale: Lessons from Thermometry

Sean Trott, Oisín Parkinson-Coombs


Abstract
Benchmarks for assessing large language model (LLM) capabilities have been criticized for a lack of construct validity. Here, we focus on an often overlooked dimension of a benchmark’s validity: namely, the functional mapping between a benchmark’s numerical score and the underlying quantity the benchmark purports to measure. What licenses the assumption that equivalent intervals on a scale correspond to equivalent differences in the underlying capability? We argue that this question is not merely theoretical: the form of this mapping (e.g., linear vs. logarithmic vs. exponential) could and should influence decisions about deployment and regulatory policy. Drawing on work from the history and philosophy of science, we discuss an analogous problem in the early history of thermometry termed the problem of nomic measurement, as well as the epistemic practices that enabled scientists to overcome these challenges. We then ask whether a similar process of epistemic iteration can overcome this problem in benchmarking. Despite clear differences between temperature and “capabilities” as constructs, we argue that some modest success could be achievable in the domain of benchmarking—but that this depends crucially on the clear articulation of a researcher’s goals and theoretical commitments.
Anthology ID:
2026.evaleval-1.21
Volume:
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Month:
July
Year:
2026
Address:
San Diego, CA
Editors:
Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat, Michelle Lin
Venues:
EvalEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
111–115
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.21/
DOI:
Bibkey:
Cite (ACL):
Sean Trott and Oisín Parkinson-Coombs. 2026. Graduating the Benchmark Scale: Lessons from Thermometry. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 111–115, San Diego, CA. Association for Computational Linguistics.
Cite (Informal):
Graduating the Benchmark Scale: Lessons from Thermometry (Trott & Parkinson-Coombs, EvalEval 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.21.pdf