2023
pdf
abs
Boosting Unsupervised Machine Translation with Pseudo-Parallel Data
Ivana Kvapilíková
|
Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any translation resources but the quality lags behind, especially in truly low-resource conditions. We propose a training strategy that relies on pseudo-parallel sentence pairs mined from monolingual corpora in addition to synthetic sentence pairs back-translated from monolingual corpora. We experiment with different training schedules and reach an improvement of up to 14.5 BLEU points (English to Ukrainian) over a baseline trained on back-translated data only.
pdf
abs
Low-Resource Machine Translation Systems for Indic Languages
Ivana Kvapilíková
|
Ondřej Bojar
Proceedings of the Eighth Conference on Machine Translation
We present our submission to the WMT23 shared task in translation between English and Assamese, Khasi, Mizo and Manipuri. All our systems were pretrained on the task of multilingual masked language modelling and denoising auto-encoding. Our primary systems for translation into English were further pretrained for multilingual MT in all four language directions and fine-tuned on the limited parallel data available for each language pair separately. We used online back-translation for data augmentation. The same systems were submitted as contrastive for translation out of English as the multilingual MT pretraining step seemed to harm the translation performance. Our primary systems for translation out of English were trained without the multilingual MT pretraining step. Other contrastive systems used additional pseudo-parallel data mined from monolingual corpora for pretraining.
2022
pdf
abs
CUNI Submission to MT4All Shared Task
Ivana Kvapilíková
|
Ondrej Bojar
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
This paper describes our submission to the MT4All Shared Task in unsupervised machine translation from English to Ukrainian, Kazakh and Georgian in the legal domain. In addition to the standard pipeline for unsupervised training (pretraining followed by denoising and back-translation), we used supervised training on a pseudo-parallel corpus retrieved from the provided mono-lingual corpora. Our system scored significantly higher than the baseline hybrid unsupervised MT system.
pdf
abs
CUNI Submission to the BUCC 2022 Shared Task on Bilingual Term Alignment
Borek Požár
|
Klára Tauchmanová
|
Kristýna Neumannová
|
Ivana Kvapilíková
|
Ondřej Bojar
Proceedings of the BUCC Workshop within LREC 2022
We present our submission to the BUCC Shared Task on bilingual term alignment in comparable specialized corpora. We devised three approaches using static embeddings with post-hoc alignment, the Monoses pipeline for unsupervised phrase-based machine translation, and contextualized multilingual embeddings. We show that contextualized embeddings from pretrained multilingual models lead to similar results as static embeddings but further improvement can be achieved by task-specific fine-tuning. Retrieving term pairs from the running phrase tables of the Monoses systems can match this enhanced performance and leads to an average precision of 0.88 on the train set.
2021
pdf
abs
Findings of the WMT Shared Task on Machine Translation Using Terminologies
Md Mahfuz Ibn Alam
|
Ivana Kvapilíková
|
Antonios Anastasopoulos
|
Laurent Besacier
|
Georgiana Dinu
|
Marcello Federico
|
Matthias Gallé
|
Kweonwoo Jung
|
Philipp Koehn
|
Vassilina Nikoulina
Proceedings of the Sixth Conference on Machine Translation
Language domains that require very careful use of terminology are abundant and reflect a significant part of the translation industry. In this work we introduce a benchmark for evaluating the quality and consistency of terminology translation, focusing on the medical (and COVID-19 specifically) domain for five language pairs: English to French, Chinese, Russian, and Korean, as well as Czech to German. We report the descriptions and results of the participating systems, commenting on the need for further research efforts towards both more adequate handling of terminologies as well as towards a proper formulation and evaluation of the task.
2020
pdf
abs
Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
Ivana Kvapilíková
|
Mikel Artetxe
|
Gorka Labaka
|
Eneko Agirre
|
Ondřej Bojar
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.
pdf
abs
CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20
Ivana Kvapilíková
|
Tom Kocmi
|
Ondřej Bojar
Proceedings of the Fifth Conference on Machine Translation
This paper presents a description of CUNI systems submitted to the WMT20 task on unsupervised and very low-resource supervised machine translation between German and Upper Sorbian. We experimented with training on synthetic data and pre-training on a related language pair. In the fully unsupervised scenario, we achieved 25.5 and 23.7 BLEU translating from and into Upper Sorbian, respectively. Our low-resource systems relied on transfer learning from German-Czech parallel data and achieved 57.4 BLEU and 56.1 BLEU, which is an improvement of 10 BLEU points over the baseline trained only on the available small German-Upper Sorbian parallel corpus.
2019
pdf
abs
CUNI Systems for the Unsupervised News Translation Task in WMT 2019
Ivana Kvapilíková
|
Dominik Macháček
|
Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
In this paper we describe the CUNI translation system used for the unsupervised news shared task of the ACL 2019 Fourth Conference on Machine Translation (WMT19). We follow the strategy of Artetxe ae at. (2018b), creating a seed phrase-based system where the phrase table is initialized from cross-lingual embedding mappings trained on monolingual data, followed by a neural machine translation system trained on synthetic parallel data. The synthetic corpus was produced from a monolingual corpus by a tuned PBMT model refined through iterative back-translation. We further focus on the handling of named entities, i.e. the part of vocabulary where the cross-lingual embedding mapping suffers most. Our system reaches a BLEU score of 15.3 on the German-Czech WMT19 shared task.