2022
pdf
abs
High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics
Markus Freitag
|
David Grangier
|
Qijun Tan
|
Bowen Liang
Transactions of the Association for Computational Linguistics, Volume 10
In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability should also be the translation with the highest quality as measured by humans. In this work, we question this assumption and show that model estimates and translation quality only vaguely correlate. We apply Minimum Bayes Risk (MBR) decoding on unbiased samples to optimize diverse automated metrics of translation quality as an alternative inference strategy to beam search. Instead of targeting the hypotheses with the highest model probability, MBR decoding extracts the hypotheses with the highest estimated quality. Our experiments show that the combination of a neural translation model with a neural reference-based metric, Bleurt, results in significant improvement in human evaluations. This improvement is obtained with translations different from classical beam-search output: These translations have much lower model likelihood and are less favored by surface metrics like Bleu.
pdf
abs
Toward More Effective Human Evaluation for Machine Translation
Belén Saldías Fuentes
|
George Foster
|
Markus Freitag
|
Qijun Tan
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Improvements in text generation technologies such as machine translation have necessitated more costly and time-consuming human evaluation procedures to ensure an accurate signal. We investigate a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set. Using a sampling approach, we demonstrate that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline. We achieve gains of up to 20% in average absolute error by leveraging stratified sampling and control variates. Our techniques can improve estimates made from a fixed annotation budget, are easy to implement, and can be applied to any problem with structure similar to the one we study.
2021
pdf
abs
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
Markus Freitag
|
George Foster
|
David Grangier
|
Viresh Ratnakar
|
Qijun Tan
|
Wolfgang Macherey
Transactions of the Association for Computational Linguistics, Volume 9
Abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.
2020
pdf
abs
Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task
Thibault Sellam
|
Amy Pu
|
Hyung Won Chung
|
Sebastian Gehrmann
|
Qijun Tan
|
Markus Freitag
|
Dipanjan Das
|
Ankur Parikh
Proceedings of the Fifth Conference on Machine Translation
The quality of machine translation systems has dramatically improved over the last decade, and as a result, evaluation has become an increasingly challenging problem. This paper describes our contribution to the WMT 2020 Metrics Shared Task, the main benchmark for automatic evaluation of translation. We make several submissions based on BLEURT, a previously published which uses transfer learning. We extend the metric beyond English and evaluate it on 14 language pairs for which fine-tuning data is available, as well as 4 “zero-shot” language pairs, for which we have no labelled examples. Additionally, we focus on English to German and demonstrate how to combine BLEURT’s predictions with those of YiSi and use alternative reference translations to enhance the performance. Empirical results show that the models achieve competitive results on the WMT Metrics 2019 Shared Task, indicating their promise for the 2020 edition.