Dynamic Human Evaluation for Relative Model Comparisons

Thórhildur Thorleiksdóttir, Cedric Renggli, Nora Hollenstein, Ce Zhang


Abstract
Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost.
Anthology ID:
2022.lrec-1.639
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5946–5955
Language:
URL:
https://aclanthology.org/2022.lrec-1.639
DOI:
Bibkey:
Cite (ACL):
Thórhildur Thorleiksdóttir, Cedric Renggli, Nora Hollenstein, and Ce Zhang. 2022. Dynamic Human Evaluation for Relative Model Comparisons. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5946–5955, Marseille, France. European Language Resources Association.
Cite (Informal):
Dynamic Human Evaluation for Relative Model Comparisons (Thorleiksdóttir et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2022.lrec-1.639.pdf
Code
 thorhildurt/dynamic-human-evaluation