Teaching Unseen Low-resource Languages to Large Translation Models

Maali Tars, Taido Purason, Andre Tättar


Abstract
In recent years, large multilingual pre-trained neural machine translation model research has grown and it is common for these models to be publicly available for usage and fine-tuning. Low-resource languages benefit from the pre-trained models, because of knowledge transfer from high- to medium-resource languages. The recently available M2M-100 model is our starting point for cross-lingual transfer learning to Finno-Ugric languages, like Livonian. We participate in the WMT22 General Machine Translation task, where we focus on the English-Livonian language pair. We leverage data from other Finno-Ugric languages and through that, we achieve high scores for English-Livonian translation directions. Overall, instead of training a model from scratch, we use transfer learning and back-translation as the main methods and fine-tune a publicly available pre-trained model. This in turn reduces the cost and duration of training high-quality multilingual neural machine translation models.
Anthology ID:
2022.wmt-1.33
Volume:
Proceedings of the Seventh Conference on Machine Translation (WMT)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
375–380
Language:
URL:
https://aclanthology.org/2022.wmt-1.33
DOI:
Bibkey:
Cite (ACL):
Maali Tars, Taido Purason, and Andre Tättar. 2022. Teaching Unseen Low-resource Languages to Large Translation Models. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 375–380, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Teaching Unseen Low-resource Languages to Large Translation Models (Tars et al., WMT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2022.wmt-1.33.pdf