Position: Toward a Metric Typology for Language Model Evaluation

Jasper Kyle Catapang


Abstract
The critique of scalar benchmark rankings as proxies for model quality is now well-established (Raji et al., 2021; Wallach et al.,2025; Bean et al., 2025; Gehrmann et al., 2021). What the field still lacks is a shared structural vocabulary for comparing, combining, and contextualizing metric design choices. This paper provides that vocabulary: a four-primitive typology—representation (𝜙), comparison (D), aggregation (A), and context (C)—under which existing metrics (BLEU, BERTScore, nDCG, LLM-as-judge, calibration scores, agentic outcome measures) are explicit parameterizations of a common form. This typology is paired with a measurement–decision split: metrics are noisy estimators of latent constructs, and model selection is context-dependent Pareto optimization over construct estimates, not over raw scores. The typology makes implicit metric assumptions comparable and debatable rather than hidden inside a single number.
Anthology ID:
2026.gem-main.78
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1015–1020
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.78/
DOI:
Bibkey:
Cite (ACL):
Jasper Kyle Catapang. 2026. Position: Toward a Metric Typology for Language Model Evaluation. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1015–1020, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Position: Toward a Metric Typology for Language Model Evaluation (Catapang, GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.78.pdf