Mapping Out the NLP Evaluation Landscape with a Standard Taxonomy of Quality Criteria

Anya Belz, Simon Mille, Craig Thomson


Abstract
Prior research shows that when papers reportresults from system evaluations in terms ofa quality criterion such as Fluency, answersto two questions are normally less clear thanthey should be: (i) was it really Fluency thatwas evaluated; and (ii) was the same aspect ofquality evaluated as in other evaluations alsoclaiming to evaluate Fluency. Answers to thesequestions are crucial if meaningful conclusionsabout the Fluency of systems, independentlyand as compared to others, are to be drawn.We map a combined total of 1,002 individualevaluations identified in three surveys of 310NLP papers to the standardised QCET inven-tory of quality criterion names and definitions.Standardisation results in up to 76% reductionin evaluation criteria names, revealing a lot ofspurious difference in evaluation naming. Weargue that conclusions drawn from NLP sys-tem evaluations are only fully interpretable andcomparable if grounding in a standard inven-tory of quality criterion names and definitionsforms part of experiment design and reporting,and we propose a way of achieving this.
Anthology ID:
2026.gem-main.77
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
999–1014
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.77/
DOI:
Bibkey:
Cite (ACL):
Anya Belz, Simon Mille, and Craig Thomson. 2026. Mapping Out the NLP Evaluation Landscape with a Standard Taxonomy of Quality Criteria. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 999–1014, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Mapping Out the NLP Evaluation Landscape with a Standard Taxonomy of Quality Criteria (Belz et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.77.pdf