Advancing NMT for Indigenous Languages: A Case Study on Yucatec Mayan and Chol

Julio Rangel, Norio Kobayashi


Abstract
This study leverages Spanish-trained large language models (LLMs) to develop neural machine translation (NMT) systems for Mayan languages. For this, we first compile and process a low-resource dataset of 28,135 translation pairs of Chol and Yucatec Mayan extracted from documents in the CPLM Corpus (Martínez et al.). Then, we implement a prompt-based approach to train one-to-many and many-to-many models. By comparing several training strategies for two LLMs, we found that, on average, training multilingual models is better, as shown by the ChrF++ reaching 50 on the test set in the best case. This study reinforces the viability of using LLMs to improve accessibility and preservation for languages with limited digital resources. We share our code, datasets, and models to promote collaboration and progress in this field: https://github.com/RIKEN-DKO/iikim_translator.
Anthology ID:
2024.americasnlp-1.16
Volume:
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Manuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
138–142
Language:
URL:
https://aclanthology.org/2024.americasnlp-1.16
DOI:
Bibkey:
Cite (ACL):
Julio Rangel and Norio Kobayashi. 2024. Advancing NMT for Indigenous Languages: A Case Study on Yucatec Mayan and Chol. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 138–142, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Advancing NMT for Indigenous Languages: A Case Study on Yucatec Mayan and Chol (Rangel & Kobayashi, AmericasNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.americasnlp-1.16.pdf