Process Standardisation for Human Evaluation of NLP System Outputs

Craig Thomson, Javier González Corbelle, Anya Belz


Abstract
Human evaluation of NLP systems has high knowledge and effort thresholds. Researchers are often expected to design and run evaluations without formal training, while also creating the required resources from scratch. Recent work has started to address the knowledge threshold, but reusable tools that reduce effort remain limited. In this paper, we take a first step toward automated human-evaluation experiment creation by (i) surveying the processes and data resources used in a representative sample of current human evaluations in NLP, and (ii) deriving a canonical process model from these survey results, which (iii) provides a basis for standardised experiment design and automated toolkit development. The survey shows that recent human-evaluation practices are highly aligned in process structure, making reusable automation feasible.
Anthology ID:
2026.gem-main.64
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
704–717
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.64/
DOI:
Bibkey:
Cite (ACL):
Craig Thomson, Javier González Corbelle, and Anya Belz. 2026. Process Standardisation for Human Evaluation of NLP System Outputs. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 704–717, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Process Standardisation for Human Evaluation of NLP System Outputs (Thomson et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.64.pdf