RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models

Ahrii Kim


Abstract
Referred to as LLM-as-judge, a generative large language model (LLM) has demonstrated considerable efficacy as an evaluator in various tasks, including Machine Translation (LAJ-MT) by predicting scores or identifying error types for individual sentences. However, its dependability in practical application has yet to be demonstrated, as there is only an approximated match due to the task’s open-ended nature. To address this problem, we introduce a straightforward and novel meta-evaluation strategy PromptCUE and evaluate cutting-edge LAJ-MT models such as GEMBA-MQM. We identify their fundamental deficits, including certain label biases and the inability to assess near-perfect translations.To improve reliability, we investigate more trustworthy and less biased models using multidimensional prompt engineering. Our findings indicate that the combination of span-level error quantification and a rubric-style prompt tailored to the characteristics of LLMs has efficiently addressed the majority of the challenges current LAJ-MT models face. Furthermore, it demonstrates a considerably enhanced alignment with human values. Accordingly, we present Rubric-MQM, the LAJ-MT for high-end models and an updated version of GEMBA-MQM.
Anthology ID:
2025.acl-industry.12
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Georg Rehm, Yunyao Li
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
147–165
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.acl-industry.12/
DOI:
Bibkey:
Cite (ACL):
Ahrii Kim. 2025. RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 147–165, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models (Kim, ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.acl-industry.12.pdf