Comparative Evaluation of Large Language Models for Linguistic Quality Assessment in Machine Translation

Daria Sinitsyna; Konstantin Savenkov

Comparative Evaluation of Large Language Models for Linguistic Quality Assessment in Machine Translation

Abstract

Building on our GPT-4 LQA research in MT, this study identifies top LLMs for an LQA pipeline with up to three models. LLMs like GPT-4, GPT-4o, GPT-4 Turbo, Google Vertex, Anthropic’s Claude 3, and Llama-3 are prompted using MQM error typology. These models generate segment-wise outputs describing translation errors, scored by severity and DQF-MQM penalties. The study evaluates four language pairs: English-Spanish, English-Chinese, English-German, and English-Portuguese, using datasets from our 2024 State of MT Report across eight domains. LLM outputs are correlated with human judgments, ranking models by alignment with human assessments for penalty score, issue presence, type, and severity. This research proposes an LQA pipeline with up to three models, weighted by output quality, highlighting LLMs’ potential to enhance MT review processes and improve translation quality.

Anthology ID:: 2024.amta-presentations.12
Volume:: Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 2: Presentations)
Month:: September
Year:: 2024
Address:: Chicago, USA
Editors:: Marianna Martindale, Janice Campbell, Konstantin Savenkov, Shivali Goel
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:: 154–183
Language:
URL:: https://preview.aclanthology.org/add_missing_videos/2024.amta-presentations.12/
DOI:
Bibkey:
Cite (ACL):: Daria Sinitsyna and Konstantin Savenkov. 2024. Comparative Evaluation of Large Language Models for Linguistic Quality Assessment in Machine Translation. In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 2: Presentations), pages 154–183, Chicago, USA. Association for Machine Translation in the Americas.
Cite (Informal):: Comparative Evaluation of Large Language Models for Linguistic Quality Assessment in Machine Translation (Sinitsyna & Savenkov, AMTA 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_missing_videos/2024.amta-presentations.12.pdf

PDF Search Fix data