Abstract
While many new automatic metrics for machine translation evaluation have been proposed in recent years, BLEU scores are still used as the primary metric in the vast majority of MT research papers. There are many reasons that researchers may be reluctant to switch to new metrics, from external pressures (reviewers, prior work) to the ease of use of metric toolkits. Another reason is a lack of intuition about the meaning of novel metric scores. In this work, we examine “rules of thumb” about metric score differences and how they do (and do not) correspond to human judgments of statistically significant differences between systems. In particular, we show that common rules of thumb about BLEU score differences do not in fact guarantee that human annotators will find significant differences between systems. We also show ways in which these rules of thumb fail to generalize across translation directions or domains.- Anthology ID:
- 2023.mtsummit-research.16
- Volume:
- Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
- Month:
- September
- Year:
- 2023
- Address:
- Macau SAR, China
- Editors:
- Masao Utiyama, Rui Wang
- Venue:
- MTSummit
- SIG:
- Publisher:
- Asia-Pacific Association for Machine Translation
- Note:
- Pages:
- 186–199
- Language:
- URL:
- https://aclanthology.org/2023.mtsummit-research.16
- DOI:
- Cite (ACL):
- Chi-kiu Lo, Rebecca Knowles, and Cyril Goutte. 2023. Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 186–199, Macau SAR, China. Asia-Pacific Association for Machine Translation.
- Cite (Informal):
- Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics (Lo et al., MTSummit 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2023.mtsummit-research.16.pdf