Teaching Unseen Low-resource Languages to Large Translation Models

Maali Tars; Taido Purason; Andre Tättar

Teaching Unseen Low-resource Languages to Large Translation Models

Abstract

In recent years, large multilingual pre-trained neural machine translation model research has grown and it is common for these models to be publicly available for usage and fine-tuning. Low-resource languages benefit from the pre-trained models, because of knowledge transfer from high- to medium-resource languages. The recently available M2M-100 model is our starting point for cross-lingual transfer learning to Finno-Ugric languages, like Livonian. We participate in the WMT22 General Machine Translation task, where we focus on the English-Livonian language pair. We leverage data from other Finno-Ugric languages and through that, we achieve high scores for English-Livonian translation directions. Overall, instead of training a model from scratch, we use transfer learning and back-translation as the main methods and fine-tune a publicly available pre-trained model. This in turn reduces the cost and duration of training high-quality multilingual neural machine translation models.

Anthology ID:: 2022.wmt-1.33
Volume:: Proceedings of the Seventh Conference on Machine Translation (WMT)
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates (Hybrid)
Venue:: WMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 375–380
Language:
URL:: https://aclanthology.org/2022.wmt-1.33
DOI:
Bibkey:
Cite (ACL):: Maali Tars, Taido Purason, and Andre Tättar. 2022. Teaching Unseen Low-resource Languages to Large Translation Models. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 375–380, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):: Teaching Unseen Low-resource Languages to Large Translation Models (Tars et al., WMT 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2022.wmt-1.33.pdf

PDF Search