GEMBA V2: Ten Judgments Are Better Than One

Marcin Junczys-Dowmunt


Abstract
We introduce GEMBA-MQM V2, an MQM-inspired, reference-free LLM evaluation metric for the WMT25 Metrics Shared Task (Subtask 1). Building on GEMBA/GEMBA-MQM, we prompt GPT-4.1-mini to produce structured MQM error annotations per segment. We map annotations to scores with 25/5/1 severity weights (minor punctuation = 0.1). To reduce stochastic variance, each segment is scored ten times and aggregated with a reciprocal-rank weighted average (RRWA) after removing outliers beyond 2𝜎. On the WMT24 MQM test sets, GEMBA-MQM V2 ranks first by average correlation, with strong results across languages and evaluation levels; WMT23 results show comparable performance.
Anthology ID:
2025.wmt-1.67
Volume:
Proceedings of the Tenth Conference on Machine Translation
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
926–933
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.67/
DOI:
Bibkey:
Cite (ACL):
Marcin Junczys-Dowmunt. 2025. GEMBA V2: Ten Judgments Are Better Than One. In Proceedings of the Tenth Conference on Machine Translation, pages 926–933, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
GEMBA V2: Ten Judgments Are Better Than One (Junczys-Dowmunt, WMT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.67.pdf