Anna Zaretskaya


2021

pdf bib
Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis
Martha Maria Papadopoulou | Anna Zaretskaya | Ruslan Mitkov
Proceedings of the Translation and Interpreting Technology Online Conference

This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach “publishable” quality and according to the number of errors they produce. For the error annotation task, an original error typology for transcription errors is proposed. This study also seeks to examine whether there is a difference in the performance of these systems between native and non-native English speakers. The experimental results suggest that among the four systems, Trint obtains the best scores. It is also observed that most systems perform noticeably better with native speakers and that all systems are most prone to fluency errors.

pdf bib
System Description for Transperfect
Wiktor Stribiżew | Fred Bane | José Conceição | Anna Zaretskaya
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

In this paper, we describe our participation in the 2021 Workshop on Asian Translation (team ID: tpt_wat). We submitted results for all six directions of the JPC2 patent task. As a first-time participant in the task, we attempted to identify a single configuration that provided the best overall results across all language pairs. All our submissions were created using single base transformer models, trained on only the task-specific data, using a consistent configuration of hyperparameters. In contrast to the uniformity of our methods, our results vary widely across the six language pairs.

pdf bib
Selecting the best data filtering method for NMT training
Fred Bane | Anna Zaretskaya
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

Performance of NMT systems has been proven to depend on the quality of the training data. In this paper we explore different open-source tools that can be used to score the quality of translation pairs, with the goal of obtaining clean corpora for training NMT models. We measure the performance of these tools by correlating their scores with human scores, as well as rank models trained on the resulting filtered datasets in terms of their performance on different test sets and MT performance metrics.

2020

pdf bib
Estimation vs Metrics: is QE Useful for MT Model Selection?
Anna Zaretskaya | José Conceição | Frederick Bane
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper presents a case study of applying machine translation quality estimation (QE) for the purpose of machine translation (MT) engine selection. The goal is to understand how well the QE predictions correlate with several MT evaluation metrics (automatic and human). Our findings show that our industry-level QE system is not reliable enough for MT selection when the MT systems have similar performance. We suggest that QE can be used with more success for other tasks relevant for translation industry such as risk prevention.

pdf bib
QE Viewer: an Open-Source Tool for Visualization of Machine Translation Quality Estimation Results
Felipe Soares | Anna Zaretskaya | Diego Bartolome
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

QE Viewer is a web-based tool for visualizing results of a Machine Translation Quality Estimation (QE) system. It allows users to see information on the predicted post-editing distance (PED) for a given file or sentence, and highlighted words that were predicted to contain MT errors. The tool can be used in a variety of academic, educational and commercial scenarios.

pdf bib
ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts
Felipe Soares | Mark Stevenson | Diego Bartolome | Anna Zaretskaya
Proceedings of the 12th Language Resources and Evaluation Conference

The Google Patents is one of the main important sources of patents information. A striking characteristic is that many of its abstracts are presented in more than one language, thus making it a potential source of parallel corpora. This article presents the development of a parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. We demonstrate the capabilities of our corpus by training Neural Machine Translation (NMT) models for the main 9 language pairs, with a total of 18 models. Our parallel corpus is freely available in TSV format and with a SQLite database, with complementary information regarding patent metadata.

2019

pdf bib
Raising the TM Threshold in Neural MT Post-Editing: a Case Study onTwo Datasets
Anna Zaretskaya
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
Optimising the Machine Translation Post-editing Workflow
Anna Zaretskaya
Proceedings of the Human-Informed Translation and Interpreting Technology Workshop (HiT-IT 2019)

In this article, we describe how machine translation is used for post-editing at TransPerfect and the ways in which we optimise the workflow. This includes MT evaluation, MT engine customisation, leveraging MT suggestions compared to TM matches, and the lessons learnt from implementing MT at a large scale.