How reproducible is best-worst scaling for human evaluation? A reproduction of ‘Data-to-text Generation with Macro Planning’

Emiel Van Miltenburg; Anouck Braggaar; Nadine Braun; Debby Damen; Martijn Goudbeek; Chris van der Lee; Frédéric Tomas; Emiel Krahmer

How reproducible is best-worst scaling for human evaluation? A reproduction of ‘Data-to-text Generation with Macro Planning’

Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Debby Damen, Martijn Goudbeek, Chris van der Lee, Frédéric Tomas, Emiel Krahmer

Abstract

This paper is part of the larger ReproHum project, where different teams of researchers aim to reproduce published experiments from the NLP literature. Specifically, ReproHum focuses on the reproducibility of human evaluation studies, where participants indicate the quality of different outputs of Natural Language Generation (NLG) systems. This is necessary because without reproduction studies, we do not know how reliable earlier results are. This paper aims to reproduce the second human evaluation study of Puduppully & Lapata (2021), while another lab is attempting to do the same. This experiment uses best-worst scaling to determine the relative performance of different NLG systems. We found that the worst performing system in the original study is now in fact the best performing system across the board. This means that we cannot fully reproduce the original results. We also carry out alternative analyses of the data, and discuss how our results may be combined with the other reproduction study that is carried out in parallel with this paper.

Anthology ID:: 2023.humeval-1.7
Volume:: Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Month:: September
Year:: 2023
Address:: Varna, Bulgaria
Editors:: Anya Belz, Maja Popović, Ehud Reiter, Craig Thomson, João Sedoc
Venues:: HumEval | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 75–88
Language:
URL:: https://aclanthology.org/2023.humeval-1.7
DOI:
Bibkey:
Cite (ACL):: Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Debby Damen, Martijn Goudbeek, Chris van der Lee, Frédéric Tomas, and Emiel Krahmer. 2023. How reproducible is best-worst scaling for human evaluation? A reproduction of ‘Data-to-text Generation with Macro Planning’. In Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems, pages 75–88, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: How reproducible is best-worst scaling for human evaluation? A reproduction of ‘Data-to-text Generation with Macro Planning’ (van Miltenburg et al., HumEval-WS 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/emnlp-22-attachments/2023.humeval-1.7.pdf

PDF Search