Integrating diverse corpora for training an endangered language machine translation system

Hunter Scheppat; Joshua Hartshorne; Dylan Leddy; Éric Le Ferrand; Emily Prud’hommeaux

Integrating diverse corpora for training an endangered language machine translation system

Hunter Scheppat, Joshua Hartshorne, Dylan Leddy, Eric Le Ferrand, Emily Prudhommeaux

Abstract

Machine translation (MT) can be a useful technology for language documentation and for promoting language use in endangered language communities. Few endangered languages, however, have an existing parallel corpus large enough to train a reasonable MT model. In this paper, we re-purpose a wide range of diverse data sources containing Amis, English, and Mandarin text to serve as parallel corpora for training MT systems for Amis, one of the Indigenous languages of Taiwan. To supplement the small amount of Amis-English data, we produce synthetic Amis-English data by using a high quality MT system to generate English translations for the Mandarin side of the Amis-Mandarin corpus. Using two popular neural MT systems, OpenNMT and NLLB, we train models to translate between English and Amis, and Mandarin and Amis. We find that including synthetic data is helpful only when translating to English. In addition, we observe that neither MT architecture is consistently superior to other and that performance seems to vary according to the direction of translation and the amount of data used. These results indicate that MT is possible for an under-resourced language even without a formally prepared parallel corpus, but multiple training methods should be explored to produce optimal results.

Anthology ID:: 2025.computel-main.19
Volume:: Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:: March
Year:: 2025
Address:: Honolulu, Hawaii, USA
Editors:: Jordan Lachler, Godfred Agyapong, Antti Arppe, Sarah Moeller, Aditi Chaudhary, Shruti Rijhwani, Daisy Rosenblum
Venues:: ComputEL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 162–169
Language:
URL:: https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.19/
DOI:
Bibkey:
Cite (ACL):: Hunter Scheppat, Joshua Hartshorne, Dylan Leddy, Eric Le Ferrand, and Emily Prudhommeaux. 2025. Integrating diverse corpora for training an endangered language machine translation system. In Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 162–169, Honolulu, Hawaii, USA. Association for Computational Linguistics.
Cite (Informal):: Integrating diverse corpora for training an endangered language machine translation system (Scheppat et al., ComputEL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.19.pdf

PDF Cite Search Fix data