Serge Gladkoff


2022

pdf
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation
Serge Gladkoff | Lifeng Han
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and h uman-centric evaluation framework for machine translation output based o n professional p ost- e diting annotations. It contains only a limited number of commonly occurring error types, and uses a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at https://github.com/lHan87/HOPE.

pdf
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Serge Gladkoff | Irina Sorokina | Lifeng Han | Alexandra Alekseeva
Proceedings of the Thirteenth Language Resources and Evaluation Conference

From both human translators (HT) and machine translation (MT) researchers’ point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAP) are often chosen as the golden standard (CITATION). Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspectives, and what is the optimal sample size of the translated text, so as to reliably estimate the translation quality of the entire material? This work carries out such a motivated research to correctly estimate the confidence intervals (CITATION) depending on the sample size of translated text, e.g. the amount of words or sentences, that needs to be processed on TQE workflow step for confident and reliable evaluation of overall translation quality. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA).

2021

pdf
cushLEPOR: customising hLEPOR metric using Optuna for higher agreement with human judgments or pre-trained language model LaBSE
Lifeng Han | Irina Sorokina | Gleb Erofeev | Serge Gladkoff
Proceedings of the Sixth Conference on Machine Translation

Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python version we developed (ported) which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses Optuna hyper-parameter optimisation framework to fine-tune hLEPOR weighting parameters towards better agreement to pre-trained language models (using LaBSE) regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards professional human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreements to PLMs like LABSE with much lower cost, and better agreements to human evaluations including MQM and pSQM scores, and yields much better performances than BLEU. Official results show that our submissions win three language pairs including English-German and Chinese-English on News domain via cushLEPOR(LM) and English-Russian on TED domain via hLEPOR. (data available at https://github.com/poethan/cushLEPOR)


cushLEPOR uses LABSE distilled knowledge to improve correlation with human translation evaluations
Gleb Erofeev | Irina Sorokina | Lifeng Han | Serge Gladkoff
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

Automatic MT evaluation metrics are indispensable for MT research. Augmented metrics such as hLEPOR include broader evaluation factors (recall and position difference penalty) in addition to the factors used in BLEU (sentence length, precision), and demonstrated higher accuracy. However, the obstacles preventing the wide use of hLEPOR were the lack of easy portable Python package and empirical weighting parameters that were tuned by manual work. This project addresses the above issues by offering a Python implementation of hLEPOR and automatic tuning of the parameters. We use existing translation memories (TM) as reference set and distillation modeling with LaBSE (Language-Agnostic BERT Sentence Embedding) to calibrate parameters for custom hLEPOR (cushLEPOR). cushLEPOR maximizes the correlation between hLEPOR and the distilling model similarity score towards reference. It can be used quickly and precisely to evaluate MT output from different engines, without need of manual weight tuning for optimization. In this session you will learn how to tune hLEPOR to obtain automatic custom-tuned cushLEPOR metric far more precise than BLEU. The method does not require costly human evaluations, existing TM is taken as a reference translation set, and cushLEPOR is created to select the best MT engine for the reference data-set.