Anna Zaretskaya


A Comparison of Data Filtering Methods for Neural Machine Translation
Fred Bane | Celia Soler Uguet | Wiktor Stribiżew | Anna Zaretskaya
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)

With the increasing availability of large-scale parallel corpora derived from web crawling and bilingual text mining, data filtering is becoming an increasingly important step in neural machine translation (NMT) pipelines. This paper applies several available tools to the task of data filtration, and compares their performance in filtering out different types of noisy data. We also study the effect of filtration with each tool on model performance in the downstream task of NMT by creating a dataset containing a combination of clean and noisy data, filtering the data with each tool, and training NMT engines using the resulting filtered corpora. We evaluate the performance of each engine with a combination of direct assessment (DA) and automated metrics. Our best results are obtained by training for a short time on all available data then filtering the corpus with cross-entropy filtering and training until convergence.

Comparing Multilingual NMT Models and Pivoting
Celia Soler Uguet | Fred Bane | Anna Zaretskaya | Tània Blanch Miró
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

Following recent advancements in multilingual machine translation at scale, our team carried out tests to compare the performance of multilingual models (M2M from Facebook and multilingual models from Helsinki-NLP) with a two-step translation process using English as a pivot language. Direct assessment by linguists rated translations produced by pivoting as consistently better than those obtained from multilingual models of similar size, while automated evaluation with COMET suggested relative performance was strongly impacted by domain and language family.


Selecting the best data filtering method for NMT training
Fred Bane | Anna Zaretskaya
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

Performance of NMT systems has been proven to depend on the quality of the training data. In this paper we explore different open-source tools that can be used to score the quality of translation pairs, with the goal of obtaining clean corpora for training NMT models. We measure the performance of these tools by correlating their scores with human scores, as well as rank models trained on the resulting filtered datasets in terms of their performance on different test sets and MT performance metrics.

Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis
Martha Maria Papadopoulou | Anna Zaretskaya | Ruslan Mitkov
Proceedings of the Translation and Interpreting Technology Online Conference

This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach “publishable” quality and according to the number of errors they produce. For the error annotation task, an original error typology for transcription errors is proposed. This study also seeks to examine whether there is a difference in the performance of these systems between native and non-native English speakers. The experimental results suggest that among the four systems, Trint obtains the best scores. It is also observed that most systems perform noticeably better with native speakers and that all systems are most prone to fluency errors.

System Description for Transperfect
Wiktor Stribiżew | Fred Bane | José Conceição | Anna Zaretskaya
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

In this paper, we describe our participation in the 2021 Workshop on Asian Translation (team ID: tpt_wat). We submitted results for all six directions of the JPC2 patent task. As a first-time participant in the task, we attempted to identify a single configuration that provided the best overall results across all language pairs. All our submissions were created using single base transformer models, trained on only the task-specific data, using a consistent configuration of hyperparameters. In contrast to the uniformity of our methods, our results vary widely across the six language pairs.


Estimation vs Metrics: is QE Useful for MT Model Selection?
Anna Zaretskaya | José Conceição | Frederick Bane
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper presents a case study of applying machine translation quality estimation (QE) for the purpose of machine translation (MT) engine selection. The goal is to understand how well the QE predictions correlate with several MT evaluation metrics (automatic and human). Our findings show that our industry-level QE system is not reliable enough for MT selection when the MT systems have similar performance. We suggest that QE can be used with more success for other tasks relevant for translation industry such as risk prevention.

QE Viewer: an Open-Source Tool for Visualization of Machine Translation Quality Estimation Results
Felipe Soares | Anna Zaretskaya | Diego Bartolome
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

QE Viewer is a web-based tool for visualizing results of a Machine Translation Quality Estimation (QE) system. It allows users to see information on the predicted post-editing distance (PED) for a given file or sentence, and highlighted words that were predicted to contain MT errors. The tool can be used in a variety of academic, educational and commercial scenarios.

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts
Felipe Soares | Mark Stevenson | Diego Bartolome | Anna Zaretskaya
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Google Patents is one of the main important sources of patents information. A striking characteristic is that many of its abstracts are presented in more than one language, thus making it a potential source of parallel corpora. This article presents the development of a parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. We demonstrate the capabilities of our corpus by training Neural Machine Translation (NMT) models for the main 9 language pairs, with a total of 18 models. Our parallel corpus is freely available in TSV format and with a SQLite database, with complementary information regarding patent metadata.


Raising the TM Threshold in Neural MT Post-Editing: a Case Study onTwo Datasets
Anna Zaretskaya
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

Optimising the Machine Translation Post-editing Workflow
Anna Zaretskaya
Proceedings of the Human-Informed Translation and Interpreting Technology Workshop (HiT-IT 2019)

In this article, we describe how machine translation is used for post-editing at TransPerfect and the ways in which we optimise the workflow. This includes MT evaluation, MT engine customisation, leveraging MT suggestions compared to TM matches, and the lessons learnt from implementing MT at a large scale.