Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Dana Ruiter, Dietrich Klakow, Josef van Genabith, Cristina España-Bonet


Abstract
For most language combinations and parallel data is either scarce or simply unavailable. To address this and unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising and while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To this date and the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT (up to +4.3 BLEU and af2en) as well as statistical (+50.8 BLEU) and hybrid UMT (+51.5 BLEU) baselines on related and distantly-related and unrelated language pairs.
Anthology ID:
2021.mtsummit-research.7
Volume:
Proceedings of Machine Translation Summit XVIII: Research Track
Month:
August
Year:
2021
Address:
Virtual
Editors:
Kevin Duh, Francisco Guzmán
Venue:
MTSummit
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
76–91
Language:
URL:
https://aclanthology.org/2021.mtsummit-research.7
DOI:
Bibkey:
Cite (ACL):
Dana Ruiter, Dietrich Klakow, Josef van Genabith, and Cristina España-Bonet. 2021. Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 76–91, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages (Ruiter et al., MTSummit 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2021.mtsummit-research.7.pdf