GEMBA V2: Ten Judgments Are Better Than One

Marcin Junczys-Dowmunt

GEMBA V2: Ten Judgments Are Better Than One

Abstract

We introduce GEMBA-MQM V2, an MQM-inspired, reference-free LLM evaluation metric for the WMT25 Metrics Shared Task (Subtask 1). Building on GEMBA/GEMBA-MQM, we prompt GPT-4.1-mini to produce structured MQM error annotations per segment. We map annotations to scores with 25/5/1 severity weights (minor punctuation = 0.1). To reduce stochastic variance, each segment is scored ten times and aggregated with a reciprocal-rank weighted average (RRWA) after removing outliers beyond 2𝜎. On the WMT24 MQM test sets, GEMBA-MQM V2 ranks first by average correlation, with strong results across languages and evaluation levels; WMT23 results show comparable performance.

Anthology ID:: 2025.wmt-1.67
Volume:: Proceedings of the Tenth Conference on Machine Translation
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:: WMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 926–933
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.67/
DOI:
Bibkey:
Cite (ACL):: Marcin Junczys-Dowmunt. 2025. GEMBA V2: Ten Judgments Are Better Than One. In Proceedings of the Tenth Conference on Machine Translation, pages 926–933, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: GEMBA V2: Ten Judgments Are Better Than One (Junczys-Dowmunt, WMT 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.67.pdf

PDF Cite Search Fix data