Corpora duplication for NLP in low-resource languages: A case study of Nahuatl
Juan Jose Guzman Landa, Juan-Manuel Torres-Moreno, Luis Moreno Jimenez, Elvys Linhares Pontes, Miguel Figueroa-Saavedra, Graham Ranger, Martha Lorena Avendaño Garrido
Abstract
In this paper, we aim to answer the following question: could corpus duplication be useful in Natural Language Processing (NLP) for low-resource languages? In these languages (or pi-languages), corpora available for training Large Language Models are virtually non-existent. Specifically, we study the impact of corpus expansion in Nahuatl, an agglutinative and polysynthetic Amerindian pi-language characterised by extensive dialectal variation. Our goal is to increase the size of Nahuatl corpora, which currently consist of a limited number of tokens, through controlled duplication techniques. Our experimental setup employs incremental duplication alongside appropriate corpus balancing, with the objective of training embeddings optimised for downstream NLP tasks. Consequently, static embeddings were trained and evaluated on a sentence-level semantic similarity task. Our results show a significant improvement in performance when incremental duplication is applied, compared to results obtained without corpus expansion. To our knowledge, this technique has not yet been explored in this field.- Anthology ID:
- 2026.americasnlp-6.11
- Volume:
- Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
- Venues:
- AmericasNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 115–127
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.11/
- DOI:
- Cite (ACL):
- Juan Jose Guzman Landa, Juan-Manuel Torres-Moreno, Luis Moreno Jimenez, Elvys Linhares Pontes, Miguel Figueroa-Saavedra, Graham Ranger, and Martha Lorena Avendaño Garrido. 2026. Corpora duplication for NLP in low-resource languages: A case study of Nahuatl. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 115–127, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Corpora duplication for NLP in low-resource languages: A case study of Nahuatl (Guzman Landa et al., AmericasNLP 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.11.pdf