Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments

Anja Belz; Simon Mille; Craig Thomson

Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments

Abstract

Research shows that two evaluation experiments reporting results for the same qualitycriterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality. Notknowing when two evaluations are comparablein this sense means we currently lack the abilityto draw conclusions based on multiple independently conducted evaluations. It is hard to seehow this issue can be fully addressed other thanby the creation of a standard set of quality criterion names and definitions that the evaluationsin use in NLP can be grounded in. Taking a descriptivist approach, the QCET Quality Criteriafor Evaluation Taxonomy derives a standard setof 114 quality criterion names and definitionsfrom three surveys of a combined total of 933evaluation experiments in NLP, and structuresthem into a reference taxonomy. We presentQCET and its uses in (i) establishing comparability of existing evaluations, (ii) guiding thedesign of new evaluations, and (iii) assessingregulation compliance.

Anthology ID:: 2025.findings-acl.1370
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:: Findings | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26685–26715
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1370/
DOI:
Bibkey:
Cite (ACL):: Anya Belz, Simon Mille, and Craig Thomson. 2025. Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments. In Findings of the Association for Computational Linguistics: ACL 2025, pages 26685–26715, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments (Belz et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1370.pdf

PDF Cite Search Fix data