Daisy Li
2026
Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori
Rolando Coto-Solano | Daisy Li | Manoela Teleginski Ferraz | Olivia Sasse | Cha Krupka | Sharid Loaiciga | Sally Akevai Tenamu Nicholas
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Rolando Coto-Solano | Daisy Li | Manoela Teleginski Ferraz | Olivia Sasse | Cha Krupka | Sharid Loaiciga | Sally Akevai Tenamu Nicholas
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present experiments on diacritic restoration, a form of text normalization essential for creating and processing data in natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resource contexts.
2024
LexiVault: A Repository for Psycholinguistic Lexicons of Lesser-studied Languages
Hind Saddiki | Samantha Wray | Daisy Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Hind Saddiki | Samantha Wray | Daisy Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents LexiVault, an open-source web tool with annotated lexicons and rich retrieval capabilities primarily developed for, but not restricted to, the support of psycholinguistic research with key measures to design stimuli for low-resource languages. Psycholinguistic research relies on human responses to carefully crafted stimuli for a better understanding of the mechanisms by which we learn, store and process language. Stimuli design captures specific language properties such as frequency, morphological complexity, or stem likelihood in a part of speech, typically derived from a corpus that is representative of the average speaker’s linguistic experience. These measures are more readily available for well-resourced languages, whereas efforts for lesser-studied languages come with substantial overhead for the researcher to build corpora and calculate these measures from scratch. This stumbling block widens the gap, further skewing our modeling of the mental architecture of linguistic processing towards a small, over-represented set of the world’s languages. To lessen this burden, we designed LexiVault to be user friendly and accommodate incremental growth of new and existing low-resource language lexicons in the system through moderated community contributions while abstracting programming complexity to foster more interest from the psycholinguistics community in exploring low-resource languages.