Evaluating Automatic Metrics with Incremental Machine Translation Systems

Guojun Wu, Shay B Cohen, Rico Sennrich


Abstract
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study not only confirms several prior findings, such as the advantage of neural metrics over non-neural ones, but also explores the debated issue of how MT quality affects metric reliability—an investigation that smaller datasets in previous research could not sufficiently explore. Overall, our research demonstrates the dataset’s value as a testbed for metric evaluation. We release our code.
Anthology ID:
2024.findings-emnlp.169
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2994–3005
Language:
URL:
https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.findings-emnlp.169/
DOI:
10.18653/v1/2024.findings-emnlp.169
Bibkey:
Cite (ACL):
Guojun Wu, Shay B Cohen, and Rico Sennrich. 2024. Evaluating Automatic Metrics with Incremental Machine Translation Systems. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2994–3005, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Evaluating Automatic Metrics with Incremental Machine Translation Systems (Wu et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.findings-emnlp.169.pdf