The Power of Simplicity: N-Grams and Transformers in Nahuatl Language Identification

Luis Mercado Campos, Robert Pugh, Alexis Palmer


Abstract
In the context of real-world language technology applications, the language or variety in which a given text is written is often unknown or uncertain. Yet, this information is crucial in order to adequately select and apply appropriate models or resources. Language identification (LID), or the process of determining the language or variety of a text sample, is thus often an important fundamental task in natural language processing. LID can be particularly challenging when: (1) there are not many labeled texts for training; and (2) similar or related languages are involved, since these may share a number of surface-level features. In this paper, we present an LID system for Nahuatl, a group of closely-related language varieties spoken in Mexico and Central America. Nahuatl LID involves both of the aforementioned challenges: Nahuatl varieties can be quite similar, sharing morphemes and even many lexical items, and there is a relative paucity of representative, variant-labeled Nahuatl text. We describe LID experiments for a total of 11 Nahuatl varieties, achieving generally good results (90.59% ±0.09% in 5-fold cross-validation experiments). Many of the outstanding errors are the result of confusion between three highly similar Huasteca variants.
Anthology ID:
2026.americasnlp-6.14
Volume:
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
153–167
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.14/
DOI:
Bibkey:
Cite (ACL):
Luis Mercado Campos, Robert Pugh, and Alexis Palmer. 2026. The Power of Simplicity: N-Grams and Transformers in Nahuatl Language Identification. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 153–167, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
The Power of Simplicity: N-Grams and Transformers in Nahuatl Language Identification (Mercado Campos et al., AmericasNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.14.pdf