An Empirical Analysis of Machine Translation for Expanding Multilingual Benchmarks

Sara Rajaee, Rochelle Choenni, Ekaterina Shutova, Christof Monz


Abstract
The rapid advancement of large language models (LLMs) has introduced new challenges in their evaluation, particularly for multilingual settings. The limited evaluation data are more pronounced in low-resource languages due to the scarcity of professional annotators, hindering fair progress across languages. In this work, we systematically investigate the viability of using machine translation (MT) as a proxy for evaluation in scenarios where human-annotated test sets are unavailable. Leveraging a state-of-the-art translation model, we translate datasets from four tasks into 198 languages and employ these translations to assess the quality and robustness of MT-based multilingual evaluation under different setups. We analyze task-specific error patterns, identifying when MT-based evaluation is reliable and when it produces misleading results. Our translated benchmark reveals that current language selections in multilingual datasets tend to overestimate LLM performance on low-resource languages. We conclude that although machine translation is not yet a fully reliable method for evaluating multilingual models, overlooking its potential means missing a valuable opportunity to track progress in non-English languages.
Anthology ID:
2025.wmt-1.1
Volume:
Proceedings of the Tenth Conference on Machine Translation
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–30
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.1/
DOI:
Bibkey:
Cite (ACL):
Sara Rajaee, Rochelle Choenni, Ekaterina Shutova, and Christof Monz. 2025. An Empirical Analysis of Machine Translation for Expanding Multilingual Benchmarks. In Proceedings of the Tenth Conference on Machine Translation, pages 1–30, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
An Empirical Analysis of Machine Translation for Expanding Multilingual Benchmarks (Rajaee et al., WMT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.1.pdf