Estonian-Centric Machine Translation: Data, Models, and Challenges

Elizaveta Korotkova; Mark Fishel

Estonian-Centric Machine Translation: Data, Models, and Challenges

Abstract

Machine translation (MT) research is most typically English-centric. In recent years, massively multilingual translation systems have also been increasingly popular. However, efforts purposefully focused on less-resourced languages are less widespread. In this paper, we focus on MT from and into the Estonian language. First, emphasizing the importance of data availability, we generate and publicly release a back-translation corpus of over 2 billion sentence pairs. Second, using these novel data, we create MT models covering 18 translation directions, all either from or into Estonian. We re-use the encoder of the NLLB multilingual model and train modular decoders separately for each language, surpassing the original NLLB quality. Our resulting MT models largely outperform other open-source MT systems, including previous Estonian-focused efforts, and are released as part of this submission.

Anthology ID:: 2024.eamt-1.55
Volume:: Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Month:: June
Year:: 2024
Address:: Sheffield, UK
Editors:: Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, Víctor M Sánchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Cabarrão, Konstantinos Chatzitheodorou, Mary Nurminen, Diptesh Kanojia, Helena Moniz
Venue:: EAMT
SIG:
Publisher:: European Association for Machine Translation (EAMT)
Note:
Pages:: 647–660
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.eamt-1.55/
DOI:
Bibkey:
Cite (ACL):: Elizaveta Korotkova and Mark Fishel. 2024. Estonian-Centric Machine Translation: Data, Models, and Challenges. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 647–660, Sheffield, UK. European Association for Machine Translation (EAMT).
Cite (Informal):: Estonian-Centric Machine Translation: Data, Models, and Challenges (Korotkova & Fishel, EAMT 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.eamt-1.55.pdf

PDF Cite Search Fix data