Abstract
Linguistic variation is a complicating factor for digital language technologies. This is particularly true for languages that lack an official “standard” variety, including many regional and minoritized languages. In this paper, we describe a set of experiments focused on multivariant natural language processing for the Nahuatl, an indigenous Mexican language with a high level of linguistic variation and no single recognized standard variant. Using small (10k tokens), recently-published annotated datasets for two Nahuatl variants, we compare the performance of single-variant, cross-variant, and joint training, and explore how different models perform on a third Nahuatl variant, unseen in training. These results and the subsequent discussion contribute to efforts of developing low-resource NLP that is robust to diatopic variation. We share all code used to process the data and run the experiments.- Anthology ID:
- 2024.vardial-1.12
- Volume:
- Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
- Venues:
- VarDial | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 140–151
- Language:
- URL:
- https://aclanthology.org/2024.vardial-1.12
- DOI:
- Cite (ACL):
- Robert Pugh and Francis Tyers. 2024. Experiments in Multi-Variant Natural Language Processing for Nahuatl. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 140–151, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Experiments in Multi-Variant Natural Language Processing for Nahuatl (Pugh & Tyers, VarDial-WS 2024)
- PDF:
- https://preview.aclanthology.org/fix-volume-bibkeys/2024.vardial-1.12.pdf