2025
pdf
bib
abs
FLORES+ Mayas: Generating Textual Resources to Foster the Development of Language Technologies for Mayan Languages
Andrés Lou
|
Juan Antonio Pérez-Ortiz
|
Felipe Sánchez-Martínez
|
Miquel Esplà-Gomis
|
Víctor M. Sánchez-Cartagena
Proceedings of Machine Translation Summit XX: Volume 2
A significant percentage of the population of Guatemala and Mexico belongs to various Mayan indigenous communities, for whom language barriers lead to social, economic, and digital exclusion. The Mayan languages spoken by these communities remain severely underrepresented in terms of digital resources, which prevents them from leveraging the latest advances in artificial intelligence. This project addresses that problem by means of: 1) the digitisation and release of multiple printed linguistic resources; 2) the development of a high-quality parallel machine translation (MT) evaluation corpus for six Mayan languages. In doing so, we are paving the way for the development of MT systems that will facilitate the access for Mayan speakers to essential services such as healthcare or legal aid. The resources are produced with the essential participation of indigenous communities, whereby native speakers provide the necessary translation services, QA, and linguistic expertise. The project is funded by the Google Academic Research Awards and carried out in collaboration with the Proyecto Lingüístico Francisco Marroquín Foundation in Guatemala.
2024
pdf
bib
abs
Lightweight neural translation technologies for low-resource languages
Felipe Sánchez-Martínez
|
Juan Antonio Pérez-Ortiz
|
Víctor Sánchez-Cartagena
|
Andrés Lou
|
Cristian García-Romero
|
Aarón Galiano-Jiménez
|
Miquel Esplà-Gomis
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
The LiLowLa (“Lightweight neural translation technologies for low-resource languages”) project aims to enhance machine translation (MT) and translation memory (TM) technologies, particularly for low-resource language pairs, where adequate linguistic resources are scarce. The project started in September 2022 and will run till August 2025.
pdf
bib
abs
Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars
Andrés Lou
|
Juan Antonio Pérez-Ortiz
|
Felipe Sánchez-Martínez
|
Víctor Sánchez-Cartagena
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at https://github.com/transducens/mayanv.