Martha Lorena Avendaño Garrido


2026

In this paper, we aim to answer the following question: could corpus duplication be useful in Natural Language Processing (NLP) for low-resource languages? In these languages (or pi-languages), corpora available for training Large Language Models are virtually non-existent. Specifically, we study the impact of corpus expansion in Nahuatl, an agglutinative and polysynthetic Amerindian pi-language characterised by extensive dialectal variation. Our goal is to increase the size of Nahuatl corpora, which currently consist of a limited number of tokens, through controlled duplication techniques. Our experimental setup employs incremental duplication alongside appropriate corpus balancing, with the objective of training embeddings optimised for downstream NLP tasks. Consequently, static embeddings were trained and evaluated on a sentence-level semantic similarity task. Our results show a significant improvement in performance when incremental duplication is applied, compared to results obtained without corpus expansion. To our knowledge, this technique has not yet been explored in this field.

2025

π-YALLI : a new corpus for Nahuatl Language Models The Nahuatl is a language with few computational resources, despite the fact that it is a living language spoken by around two million people. We built π-YALLI, a corpus that enables research and development of dynamic and static Language Models (LM). We measured the perplexity of π-YALLI, evaluating state-of-the-art LM performance on a manually annotated semantic similarity corpus relative to annotator agreement. The results show the difficulty of working with this π-language, but at the same time open up interesting perspectives for the study of other NLP tasks on Nahuatl.