Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System

Rudali Huidrom; Ondřej Dušek; Zdeněk Kasner; Thiago Castro Ferreira; Anja Belz

Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System

Rudali Huidrom, Ondřej Dušek, Zdeněk Kasner, Thiago Castro Ferreira, Anya Belz

Abstract

In this paper, we present the results of two re- production studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against ref- erence outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evalua- tion. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of re- producibility depending on result type, data and labelling task. Our resources are available and open-sourced.

Anthology ID:: 2022.inlg-genchal.9
Volume:: Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
Month:: July
Year:: 2022
Address:: Waterville, Maine, USA and virtual meeting
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 52–61
Language:
URL:: https://aclanthology.org/2022.inlg-genchal.9
DOI:
Bibkey:
Cite (ACL):: Rudali Huidrom, Ondřej Dušek, Zdeněk Kasner, Thiago Castro Ferreira, and Anya Belz. 2022. Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System. In Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges, pages 52–61, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System (Huidrom et al., INLG 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2022.inlg-genchal.9.pdf

PDF Search