Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

Tom Kocmi; Ekaterina Artemova; Eleftherios Avramidis; Rachel Bawden; Ondřej Bojar; Konstantin Dranch; Anton Dvorkovich; Sergey Dukanov; Mark Fishel; Markus Freitag; Thamme Gowda; Roman Grundkiewicz; Barry Haddow; Marzena Karpinska; Philipp Koehn; Howard Lakougna; Jessica Lundin; Christof Monz; Kenton Murray; Masaaki Nagata; Stefano Perrella; Lorenzo Proietti; Martin Popel; Maja Popović; Parker Riley; Mariya Shmatova; Steinþór Steingrímsson; Lisa Yankovskaya; Vilém Zouhar

Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo Proietti, Martin Popel, Maja Popović, Parker Riley, Mariya Shmatova, Steinthór Steingrímsson, Lisa Yankovskaya, Vilém Zouhar

Abstract

This paper presents the results of the General Machine Translation Task organized as part of the 2025 Conference on Machine Translation (WMT). Participants were invited to build systems for any of 30 language pairs. For half of these pairs, we conducted a human evaluation on test sets spanning four to five different domains.We evaluated 60 systems in total: 36 submitted by participants and 24 for which we collected translations from large language models (LLMs) and popular online translation providers.This year, we focused on creating challenging test sets by developing a difficulty sampling technique and using more complex source data. We evaluated system outputs with professional annotators using the Error Span Annotation (ESA) protocol, except for two language pairs, for which we used Multidimensional Quality Metrics (MQM) instead.We continued the trend of increasingly moving towards document-level translation, providing the source texts as whole documents containing multiple paragraphs.

Anthology ID:: 2025.wmt-1.22
Volume:: Proceedings of the Tenth Conference on Machine Translation
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:: WMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 355–413
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.22/
DOI:
Bibkey:
Cite (ACL):: Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo Proietti, Martin Popel, Maja Popović, Parker Riley, Mariya Shmatova, Steinthór Steingrímsson, Lisa Yankovskaya, and Vilém Zouhar. 2025. Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets. In Proceedings of the Tenth Conference on Machine Translation, pages 355–413, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets (Kocmi et al., WMT 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.22.pdf

PDF Cite Search Fix data