Abstract
Human evaluation is widely considered the most reliable form of evaluation in NLP, but recent research has shown it to be riddled with mistakes, often as a result of manual execution of tasks. This paper argues that such mistakes could be avoided if we were to automate, as much as is practical, the process of performing experiments for human evaluation of NLP systems. We provide a simple methodology that can improve both the transparency and reproducibility of experiments. We show how the sequence of component processes of a human evaluation can be defined in advance, facilitating full or partial automation, detailed preregistration of the process, and research transparency and repeatability.- Anthology ID:
- 2024.inlg-main.22
- Volume:
- Proceedings of the 17th International Natural Language Generation Conference
- Month:
- September
- Year:
- 2024
- Address:
- Tokyo, Japan
- Editors:
- Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
- Venue:
- INLG
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 272–279
- Language:
- URL:
- https://aclanthology.org/2024.inlg-main.22
- DOI:
- Cite (ACL):
- Craig Thomson and Anya Belz. 2024. (Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems. In Proceedings of the 17th International Natural Language Generation Conference, pages 272–279, Tokyo, Japan. Association for Computational Linguistics.
- Cite (Informal):
- (Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems (Thomson & Belz, INLG 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.inlg-main.22.pdf