Machine Translation for Low-resource Finno-Ugric Languages

Lisa Yankovskaya, Maali Tars, Andre Tättar, Mark Fishel


Abstract
This paper focuses on neural machine translation (NMT) for low-resource Finno-Ugric languages. Our contributions are three-fold: (1) we extend existing and collect new parallel and monolingual corpora for 20 languages, (2) we expand the 200-language translation benchmark FLORES-200 with manual translations into nine new languages, and (3) we present experiments using the collected data to create NMT systems for the included languages and investigate the impact of back-translation data on the NMT performance for low-resource languages. Experimental results show that carefully selected limited amounts of back-translation directions yield the best results in terms of translation scores, for both high-resource and low-resource output languages.
Anthology ID:
2023.nodalida-1.77
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
762–771
Language:
URL:
https://aclanthology.org/2023.nodalida-1.77
DOI:
Bibkey:
Cite (ACL):
Lisa Yankovskaya, Maali Tars, Andre Tättar, and Mark Fishel. 2023. Machine Translation for Low-resource Finno-Ugric Languages. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 762–771, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Machine Translation for Low-resource Finno-Ugric Languages (Yankovskaya et al., NoDaLiDa 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2023.nodalida-1.77.pdf