Teaching Unseen Low-resource Languages to Large Translation Models

Maali Tars; Taido Purason; Andre Tättar

Teaching Unseen Low-resource Languages to Large Translation Models

Abstract

In recent years, large multilingual pre-trained neural machine translation model research has grown and it is common for these models to be publicly available for usage and fine-tuning. Low-resource languages benefit from the pre-trained models, because of knowledge transfer from high- to medium-resource languages. The recently available M2M-100 model is our starting point for cross-lingual transfer learning to Finno-Ugric languages, like Livonian. We participate in the WMT22 General Machine Translation task, where we focus on the English-Livonian language pair. We leverage data from other Finno-Ugric languages and through that, we achieve high scores for English-Livonian translation directions. Overall, instead of training a model from scratch, we use transfer learning and back-translation as the main methods and fine-tune a publicly available pre-trained model. This in turn reduces the cost and duration of training high-quality multilingual neural machine translation models.

Anthology ID:: 2022.wmt-1.33
Volume:: Proceedings of the Seventh Conference on Machine Translation (WMT)
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates (Hybrid)
Editors:: Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Marco Turchi, Marcos Zampieri
Venue:: WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 375–380
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2022.wmt-1.33/
DOI:
Bibkey:
Cite (ACL):: Maali Tars, Taido Purason, and Andre Tättar. 2022. Teaching Unseen Low-resource Languages to Large Translation Models. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 375–380, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):: Teaching Unseen Low-resource Languages to Large Translation Models (Tars et al., WMT 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2022.wmt-1.33.pdf

PDF Cite Search Fix data