Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages
Dana Ruiter, Dietrich Klakow, Josef van Genabith, Cristina España-Bonet
Abstract
For most language combinations and parallel data is either scarce or simply unavailable. To address this and unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising and while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To this date and the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT (up to +4.3 BLEU and af2en) as well as statistical (+50.8 BLEU) and hybrid UMT (+51.5 BLEU) baselines on related and distantly-related and unrelated language pairs.- Anthology ID:
- 2021.mtsummit-research.7
- Volume:
- Proceedings of Machine Translation Summit XVIII: Research Track
- Month:
- August
- Year:
- 2021
- Address:
- Virtual
- Editors:
- Kevin Duh, Francisco Guzmán
- Venue:
- MTSummit
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- 76–91
- Language:
- URL:
- https://aclanthology.org/2021.mtsummit-research.7
- DOI:
- Cite (ACL):
- Dana Ruiter, Dietrich Klakow, Josef van Genabith, and Cristina España-Bonet. 2021. Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 76–91, Virtual. Association for Machine Translation in the Americas.
- Cite (Informal):
- Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages (Ruiter et al., MTSummit 2021)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2021.mtsummit-research.7.pdf