Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization

Kaden Uhlig, Joern Wuebker, Raphael Reinauer, John Denero


Abstract
Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying task-alignment to neural machine translation (NMT) addresses an existing task–data mismatch in NMT, leading to improvements across all languages of a multilingual model, even when task-alignment is only applied to a subset of those languages. We do so by introducing Direct Quality Optimization (DQO), a variant of DPO leveraging a pre-trained translation quality estimation model as a proxy for human preferences, and verify the improvements with both automatic metrics and through human evaluation.
Anthology ID:
2025.wmt-1.2
Volume:
Proceedings of the Tenth Conference on Machine Translation
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31–51
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.2/
DOI:
Bibkey:
Cite (ACL):
Kaden Uhlig, Joern Wuebker, Raphael Reinauer, and John Denero. 2025. Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization. In Proceedings of the Tenth Conference on Machine Translation, pages 31–51, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization (Uhlig et al., WMT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.2.pdf