Agreement is overrated: A plea for correlation to assess human evaluation reliability

Jacopo Amidei; Paul Piwek; Alistair Willis

doi:10.18653/v1/W19-8642

Agreement is overrated: A plea for correlation to assess human evaluation reliability

Jacopo Amidei, Paul Piwek, Alistair Willis

Abstract

Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG evaluation data, in particular, its reliability. According to existing scales of IAA interpretation – see, for example, Lommel et al. (2014), Liu et al. (2016), Sedoc et al. (2018) and Amidei et al. (2018a) – most data collected for NLG evaluation fail the reliability test. We confirmed this trend by analysing papers published over the last 10 years in NLG-specific conferences (in total 135 papers that included some sort of human evaluation study). Following Sampson and Babarczy (2008), Lommel et al. (2014), Joshi et al. (2016) and Amidei et al. (2018b), such phenomena can be explained in terms of irreducible human language variability. Using three case studies, we show the limits of considering IAA as the only criterion for checking evaluation reliability. Given human language variability, we propose that for human evaluation of NLG, correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability. This is illustrated using the three case studies.

Anthology ID:: W19-8642
Volume:: Proceedings of the 12th International Conference on Natural Language Generation
Month:: October–November
Year:: 2019
Address:: Tokyo, Japan
Editors:: Kees van Deemter, Chenghua Lin, Hiroya Takamura
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 344–354
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-2/W19-8642/
DOI:: 10.18653/v1/W19-8642
Bibkey:
Cite (ACL):: Jacopo Amidei, Paul Piwek, and Alistair Willis. 2019. Agreement is overrated: A plea for correlation to assess human evaluation reliability. In Proceedings of the 12th International Conference on Natural Language Generation, pages 344–354, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):: Agreement is overrated: A plea for correlation to assess human evaluation reliability (Amidei et al., INLG 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/W19-8642.pdf

PDF Cite Search Fix data