GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4

Tom Kocmi, Christian Federmann


Abstract
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans. Compared to previous works, our method has language-agnostic prompts, thus avoiding the need for manual prompt preparation for new languages. While preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking, we advise caution when using it in academic works to demonstrate improvements over other methods due to its dependence on the proprietary, black-box GPT model.
Anthology ID:
2023.wmt-1.64
Volume:
Proceedings of the Eighth Conference on Machine Translation
Month:
December
Year:
2023
Address:
Singapore
Editors:
Philipp Koehn, Barry Haddow, Tom Kocmi, Christof Monz
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
768–775
Language:
URL:
https://aclanthology.org/2023.wmt-1.64
DOI:
10.18653/v1/2023.wmt-1.64
Bibkey:
Cite (ACL):
Tom Kocmi and Christian Federmann. 2023. GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775, Singapore. Association for Computational Linguistics.
Cite (Informal):
GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4 (Kocmi & Federmann, WMT 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2023.wmt-1.64.pdf