Abstract
This work examines different ways of aggregating scores for error annotation in MT outputs: raw error counts, error counts normalised over total number of words (word percentage’), and error counts normalised over total number of errors (error percentage’). We use each of these three scores to calculate inter-annotator agreement in the form of Krippendorff’s alpha and Pearson’s r and compare the obtained numbers, overall and separately for different types of errors. While each score has its advantages depending on the goal of the evaluation, we argue that the best way of estimating inter-annotator agreement using such numbers are raw counts. If the annotation process ensures that the total number of words cannot differ among the annotators (for example, due to adding omission symbols), normalising over number of words will lead to the same conclusions. In contrast, total number of errors is very subjective because different annotators often perceive different amount of errors in the same text, therefore normalising over this number can indicate lower agreements.- Anthology ID:
- 2022.gem-1.26
- Volume:
- Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates (Hybrid)
- Venue:
- GEM
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 306–315
- Language:
- URL:
- https://aclanthology.org/2022.gem-1.26
- DOI:
- 10.18653/v1/2022.gem-1.26
- Cite (ACL):
- Maja Popović and Anya Belz. 2022. On reporting scores and agreement for error annotation tasks. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 306–315, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- On reporting scores and agreement for error annotation tasks (Popović & Belz, GEM 2022)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2022.gem-1.26.pdf