From Field Linguistics to NLP: Creating a curated dataset in Amuzgo language

Antonio Reyes, Hamlet Antonio García


Abstract
This article presents an ongoing research on one of the several native languages of the Americas: Amuzgo or jny’on3 nda3 . This language is spoken in Southern Mexico and belongs to the Otomanguean family. Although Amuzgo vitality is stable and there are some available resources, such as grammars, dictionaries, or literature, its digital inclusion is emerging (cf. Eberhard et al. (2024)). In this respect, here is described the creation of a curated dataset in Amuzgo. This resource is intended to contribute the development of tools for scarce resources languages by providing fine-grained linguistic information in different layers: From data collection with native speakers to data annotation. The dataset was built according to the following method: i) data collection in Amuzgo by means of linguistic fieldwork; ii) acoustic data processing; iii) data transcription; iv) glossing and translating data into Spanish; v) semiautomatic alignment of translations; and vi) data systematization. This resource is released as an open access dataset to foster the academic community to explore the richness of this language.
Anthology ID:
2024.americasnlp-1.14
Volume:
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Manuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
127–131
Language:
URL:
https://aclanthology.org/2024.americasnlp-1.14
DOI:
Bibkey:
Cite (ACL):
Antonio Reyes and Hamlet Antonio García. 2024. From Field Linguistics to NLP: Creating a curated dataset in Amuzgo language. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 127–131, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
From Field Linguistics to NLP: Creating a curated dataset in Amuzgo language (Reyes & García, AmericasNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.americasnlp-1.14.pdf
Supplementary material:
 2024.americasnlp-1.14.SupplementaryMaterial.zip