Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
Godfred Agyapong, Sarah Moeller, Antti Arppe, Ali Marashian, Daisy Rosenblum (Editors)
- Anthology ID:
- 2026.computel-1
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Venues:
- ComputEL | WS
- Events:
- Annual Meeting of the Association for Computational Linguistics (2026) | Workshop on the Use of Computational Methods in the Study of Endangered Languages (2026) | Other Workshops and Events (2026)
- SIG:
- Publisher:
- Association for Computational Linguistics
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.computel-1/
- DOI:
- ISBN:
- 979-8-89176-422-4
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.computel-1.pdf
Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
Godfred Agyapong | Sarah Moeller | Antti Arppe | Ali Marashian | Daisy Rosenblum
Godfred Agyapong | Sarah Moeller | Antti Arppe | Ali Marashian | Daisy Rosenblum
Morphological Parsing for Media Lengua: When Accessibility Matters More Than State-of-the-Art
Jesse Stewart | Olga Kriukova
Jesse Stewart | Olga Kriukova
While machine learning approaches dominate contemporary NLP research, a critical gap exists between published models and tools actually used by target communities (Gessler & von der Wense, 2024). This paper presents two morphological parsers for Media Lengua (ISO 639-3: mue), an endangered mixed language of Ecuador, demonstrating that a JavaScript rule-based system (98.6% accuracy) can outperform a CRF model (95.7% F1) while offering immediate community accessibility.Not all language structures permit straightforward rule-based parsing; however, when a language’s morphology allows for this approach with competitive accuracy, we argue that it should be preferred for its practical advantages: immediate browser-based deployment, transparency, zero infrastructure requirements, and long-term maintainability. Our rule-based parser runs entirely in the browser, is freely available online, and can be adapted to other Quechuan languages. In contrast, while the CRF model performs well on benchmarks, it requires additional infrastructure to become accessible.Our comparison highlights the need to evaluate NLP tools not only on accuracy metrics but also on accessibility and real-world adoption, which is particularly crucial for endangered language communities where sustainable, community-accessible tools can support language documentation, education, and revitalization.
Speech Recognition and Synthesis Technologies Applied to Preservation and Revitalization of the Ainu Language
Tatsuya Kawahara | Kohei Matsuura
Tatsuya Kawahara | Kohei Matsuura
This paper gives an overview of our activities in developing automatic speech recognition (ASR) and text-to-speech (TTS) systems for the preservation and revitalization of the Ainu language, once spoken in the Hokkaido area of Japan, and listed as "severely endangered" of extinction. With a large pretrained model, a high-performing ASR system can be trained even with five hours of speech from a few speakers. It has been used to streamline the transcription and archiving of old recordings. A TTS system is also developed and used for revitalizing the speech of old folktales whose audio is missing. It is also used to provide a reference for speaking practice for new Ainu speakers. Speech technologies are important for endangered languages because their cultures have typically been passed down orally, and our efforts will be useful for passing them on to the future.
Choosing an ASR model for Dënë Sųłıné: Navigating polysynthesis and unstandardized orthography
Olga Kriukova | Antti Arppe | Olga Lovick
Olga Kriukova | Antti Arppe | Olga Lovick
While several pre-trained multilingual models are actively used for fine-tuning on under-resourced and endangered languages, it remains unclear which architectures perform better and what factors explain their varying performance across languages. Although this question may be less pressing for languages with adequate resources, it is critical for endangered language communities, where limited time and funding to experiment with multiple model options are available (Jimerson et al., 2023). We compare the performance of two ASR architectures, Wav2Vec2 and Whisper, on a Dënë Sųłıné dataset. This language and dataset present several challenges common to under-resourced and endangered languages: unstandardized orthography, pronunciation variation, and phonological and morphosyntactic structures that differ from the major languages represented in the multilingual datasets used for pre-training large ASR models. Although Wav2Vec2 reportedly outperforms Whisper in low-resource settings (see e.g., Coto-Solano et al., 2024; Nahabwe et al., 2025; Williams et al., 2023), our study shows that Whisper yields significantly better results on the Dënë Sųłıné dataset. These findings suggest that model performance may depend not only on architecture, dataset size, or typological features of language, but also on dataset-specific characteristics. In our case, Whisper showed better adaptability to a dataset with inconsistent spelling and pronunciation. Further verification across similarly inconsistent datasets is required to assess the generalizability of this result.
An Interactive System for Generating Revisable Grammar Lessons for Extremely Low-Resource Languages Without Expert Annotation
Sebastien Christian
Sebastien Christian
Endangered-language teaching often faces two practical bottlenecks: the scarcity of experts able to produce pedagogical grammars, and the dependence of most approaches on expert linguistic annotation. We present a human-in-the-loop system for extremely low-resource languages that addresses both constraints by combining lightweight concept-based annotation, typological inference, structured sentence-pair augmentation, document retrieval, and constrained language model generation. Rather than aiming to produce definitive grammatical descriptions, the system generates revisable grammar lesson drafts grounded in heterogeneous evidence, including elicited sentence pairs, free translation pairs, and descriptive documents. The interface is designed so that speakers, teachers, and other language practitioners without formal linguistic training can contribute usable data, inspect intermediate inferences, control source selection and generate draft grammar lessons. We describe the architecture, user workflows, and initial deployment experience in real-world revitalization settings. The contribution of the paper is an implemented workflow for early pedagogical draft generation under extreme data scarcity, not a controlled evaluation of pedagogical effectiveness.
Voices from the Margins: Modeling Linguistic Diversity in Spontaneous Speech for Low-Resource Languages
Vitthal Bhandari | Tiya Kumar | Katharine Mulhern
Vitthal Bhandari | Tiya Kumar | Katharine Mulhern
We conduct Automatic speech recognition (ASR) experiments on the Common Voice Spontaneous Speech dataset by Mozilla Data Collective, consisting of 21 low-resource languages across four continents of the world. We fine-tune popular multilingual speech models on all languages of this dataset, and observe that while a single-best-model solution doesn’t exist, the Massively Multilingual Speech model and Whisper achieve superior performance on certain languages. Through n-gram language modeling decoding experiments, we observe a significant improvement in error rate over greedy decoding by up to 27.3%. We follow our experiments with a close linguistic error analysis of the best performing models on Scots (sco) and Nubi (kcn) - two of the languages in our dataset, with very little prior audio and text modeling research. We highlight the morphosyntactic errors induced during speech recognition and perform a holistic analysis of these languages. We finally advocate for the importance of building efficient and accurate ASR tools for modeling speech in endangered languages with scarce resources, and their applications to language revitalization, language learning assistance, and accessibility.
Digital posters: Publishing Gurindji plant and animal poster content as websites using an open-source template-based RO-Crate preview tool
Ben Foley | Abigail Davis | Felicity Meakins
Ben Foley | Abigail Davis | Felicity Meakins
Bringing together Gurindji language material from an award-winning poster series and an existing website tool, our work demonstrates the benefits arising from packaging existing language material according to the RO-Crate standard. We describe a relatively fast, low-cost, low-maintenance and long-lasting method of publishing language content online with data in RO-Crate format. The production leverages the prior work done in collating content, requiring minimal further work to reformat and republish for online publication. Four websites were built using this method.
AvarLab: An Integrated Digital Ecosystem for Avar, a Morphologically Rich Low-Resource Language
Kebed Zagidov | Thomas Brochhagen
Kebed Zagidov | Thomas Brochhagen
This paper presents a digital ecosystem designed for Avar, a morphologically rich and vulnerable Northeast Caucasian language. Addressing the common bottleneck where lexical resources, corpora, and computational tools are developed in isolation or are entirely absent, we propose the "generate-verify" workflow. By developing a scalable, rule-based computational architecture, our system specifically targets the challenges of low-resource settings, overcoming data sparsity to generate over one million inflected forms from a static dictionary of 14,700 entries.Furthermore, by coupling morphological generation with corpus verification, we introduce a dynamic method to rapidly analyze and expand endangered language data. This approach transforms static linguistic documentation into active language reclamation tools, supporting dictionary lookup and the creation of silver-standard annotations for downstream NLP. The platform also serves as a unified model for the collection, management, and mobilization of fragmented language data, ensuring that the resulting resources are directly accessible and beneficial to the speaker community. Ultimately, AvarLab provides a practical, adaptable pathway for building sustainable digital infrastructure by fostering interaction among documentary linguists, computer scientists, and native speakers.
Revitalising Endangered Languages and Cultural Heritage through Language Technology: A Pilot Study for Dzardzongke
Hannah Claus | Songbo Hu | Emre Isik | Anna Korhonen | Kitty Liu | Marieke Meelen
Hannah Claus | Songbo Hu | Emre Isik | Anna Korhonen | Kitty Liu | Marieke Meelen
In this short paper, we present the first prototype of a mobile application to help preserve and revitalise the endangered language and cultural heritage of the speakers of Dzardzongke, a Tibetic language spoken in South Mustang, Nepal. With this pilot study, we provide a collaborative and highly accessible solution to revitalisation that has potential for any community interested in preserving their language and culture.
Annotation Tools for Language Documentation: A Survey of Capabilities, Gaps, and Morphological Support
Changbing Yang | Pt Anderson | Godfred Agyapong | Sarah Moeller
Changbing Yang | Pt Anderson | Godfred Agyapong | Sarah Moeller
Annotation tools are foundational infrastructure for language documentation, yet few comprehensive surveys have evaluated the tool landscape specifically from a documentary linguistics perspective. We survey 98 annotation tools across dimensions critical to language documentation workflows: annotation support, collaboration features, active learning, cost and openness, and institutional sustainability. Of the 44 tools both free and accessible for evaluation, only 15 support morpheme segmentation and glossing, and only 6 combine morphological annotation with remote collaboration at no cost. We identify a structural gap between the current tools and the requirements of field linguists working with endangered and Indigenous languages. While many NLP tools prioritize scalable annotation for high-resource settings, documentary linguists need interlinear glossed text (IGT) support and community-accessible interfaces. We taxonomise the tool landscape, present a multi-dimensional feature matrix, suggest current tools for language documentation, and conclude with concrete recommendations for tool developers and the documentary linguistics community.
Addressing Domain Mismatch in ASR for Akuzipik Language Documentation
Summer Chambers | Sylvia Woodrose Schwartz | Matthew Kelley | Lane Woodrose Schwartz
Summer Chambers | Sylvia Woodrose Schwartz | Matthew Kelley | Lane Woodrose Schwartz
The use of ASR models in endangered language documentation has grown in popularity given the bottleneck of manual speech transcription. Meta’s Massively Multilingual Speech (MMS) model is particularly popular for its extensibility to low-resource languages. However, it is mostly trained on read speech data from the Bible, meaning it may not perform well on other domains. We evaluated this model on data collected as part of a larger language documentation and revitalization project focused on Akuzipik, a polysynthetic Alaska Native language. We also finetuned and evaluated the model on a small (1h) collection of speech. The original model performed well on a dataset that roughly matched the Bible training data in domain and writing style but struggled on a separate collection of spontaneous speech. Performance on spontaneous speech improved after finetuning on a sample of our full dataset, and error rates reduced less dramatically after finetuning only on read speech. Both finetuning scenarios show promise for future model improvement, especially considering the relative ease of collecting read speech data. This experiment confirms the challenge of transcribing spontaneous speech with the MMS ASR model but provides hope for improving model performance for language documentation purposes, even with scarce data.
This paper investigates the challenges of low-resource machine translation for ʻŌlelo Hawaiʻi (Hawaiian), a critically endangered Polynesian language. We compile a corpus of publicly available Hawaiian-English bitext and investigate the effectiveness of neural sequence-to-sequence models and large language models for translating Hawaiian. To address data scarcity, we employ various data augmentation techniques, including backtranslation, multilingual training using parallel corpora in related languages, and leveraging dictionary entries. Our experiments demonstrate that multilingual training significantly improves model performance, particularly when incorporating bitext from related Polynesian languages. Fine-tuned large language models were not able to outperform mBART, highlighting that smaller and simpler models are still relevant, especially in low-resource scenarios.
Creole languages emerged from colonial contact and the slave trade. Although they inheritthe bulk of their vocabulary from a "lexifier"language, they remain classic low-resourcelanguages, presenting significant challengesfor speech technology. This paper exploreshow the abundant resources of a lexifier canbe leveraged for Creole-specific tools, focusing on Automatic Speech Recognition (ASR).Specifically, we use an artificial dataset generated a French-trained Text-to-Speech (TTS)model and French datasets to pre-finetune ASRmodels for two French-based Creoles. Ourresults demonstrate that a two-stage trainingsetup where models are first trained on artificial datasets leads to substantial performanceboost for transcribing Creole languages. Additionally, this approach serves as a viable firststep for ASR development in zero-resource scenarios.
Indigenous Writing Systems Matter: Rethinking NLP beyond Alphabetic Bias through Script-Aware Modeling
Ngoc Tan Le | Mamady Traore | Cristian Ahumada Oliva | Fatiha Sadat
Ngoc Tan Le | Mamady Traore | Cristian Ahumada Oliva | Fatiha Sadat
Natural Language Processing (NLP) has made significant progress in recent years, largely driven by large-scale pretrained models and vast textual and multimodal corpora. However, these advances remain unevenly distributed, disproportionately benefiting high-resource languages while Indigenous and endangered languages—especially those employing diverse and less widely supported writing systems—remain underrepresented. This paper examines the role of writing system diversity in NLP, with a focus on Indigenous and endangered languages. We propose a theoretical framework that accounts for variation across writing systems and its implications for computational modeling. Specifically, we (i) provide an overview of writing system diversity, (ii) synthesize available computational resources, and (iii) present a structured analysis of challenges in modeling, tokenization, and evaluation.Our analysis shows that writing system diversity reveals structural biases embedded in current NLP pipelines. We conclude by identifying key open challenges and outlining directions for future research toward more inclusive, script-aware NLP approaches that better account for writing system variation.
Language archives contain valuable linguistic materials that are undigitized and therefore difficult to access. Modern optical character recognition (OCR) systems have great potential to make these collections more accessible, but there are few system evaluations which can assess the quality of an OCR system specifically for language archive materials. We present CoRSAL-OCR, an OCR evaluation dataset of over 200 document pages with gold-standard transcriptions from two South Asian languages: Bodo (written in Devanagari) and Garo (written in Latin script). Using this dataset together with the 8-language AILLA-OCR benchmark, we evaluate four OCR systems: Tesseract, Google Cloud Vision, Gemini 3 Flash, and Qwen3.5-27B (an open-weight model). We find that vision language models (VLMs), when given appropriate prompts, achieve the lowest error rates on these datasets. However, prompt design has a large effect on VLM performance, with a detailed generic prompt reducing CER by up to six-fold compared to a minimal prompt. We release our dataset at https://github.com/larc-iu/corsal-ocr to support further research on OCR for language archives.
The Missing Middle: Language Documentation Needs Better Infrastructure, Not Better Models
Luke Gessler | Antonios Anastasopoulos | Sandra Auderset | Timotheus Bodt | Shobhana Chelliah | Sebastien Christian | Maxime Fily | Santiago Herrera | Eva Huber | Sharid Loaiciga | Marieke Meelen | Robert Östling | Alexis Palmer | Eline Visser
Luke Gessler | Antonios Anastasopoulos | Sandra Auderset | Timotheus Bodt | Shobhana Chelliah | Sebastien Christian | Maxime Fily | Santiago Herrera | Eva Huber | Sharid Loaiciga | Marieke Meelen | Robert Östling | Alexis Palmer | Eline Visser
Despite decades of progress in human language technology (HLT) and growing research interest in endangered languages, practical uptake of HLT in documentary linguistics workflows remains rare. In this opinion piece, we report on a structured dialogue among approximately twenty academics convened to diagnose why this gap persists. Across all topics, we identify a recurring structural problem, which we call the missing middle: despite the existence of many potentially useful HLTs, the connective infrastructure necessary to make them genuinely accessible to linguists and language communities does not exist. We report the details of our discussion and make four specific recommendations for how those active in language documentation and HLT research might orient their future work.
Aspects of Selecting the Right ASR Training Languages for Under-Resourced Languages
J. Elizabeth Liebl | Summer Chambers | Matthew Kelley | Géraldine Walther
J. Elizabeth Liebl | Summer Chambers | Matthew Kelley | Géraldine Walther
We investigate how training languages should be selected for cross-lingual IPA ASR on unseen languages. Using Common Voice audio and Vox Communis phonetic transcripts, we train multilingual IPA-based ASR models for Upper Sorbian, Luganda, and Tatar under three linguistically motivated selection strategies: genealogical relatedness, geographic proximity, and phonological inventory overlap. We compare these strategies to a random baseline and evaluate performance with phone error rate. Linguistically informed selection generally improves transfer, but no single strategy is consistently optimal. Geographic proximity performs best for Luganda, phonological overlap is slightly best for Tatar, and none of the proposed strategies outperform random selection for Upper Sorbian. The results suggest that linguistic similarity aids low-resource ASR transfer, but that the most useful dimension of similarity varies by target language.
Bottlenecks of In-Context Learning for Fieldwork ASR: A Case-study of Panãra
Siyu Liang | Myriam Lapierre | Gina-Anne Levow
Siyu Liang | Myriam Lapierre | Gina-Anne Levow
In-context learning (ICL) enables ASR models to transcribe unseen languages by conditioning on a handful of audio-transcript pairs at inference time, with no fine-tuning. This is appealing for language documentation, where transcribed data is scarce and recording conditions vary across sessions. We evaluate ICL on Panãra (Northern Jê, Brazil), a language with a complex practical orthography in which diacritics encode phonemic contrasts, across seven fieldwork recordings varying in speaker, narrative, and recording context. We find substantial within-language variation in transcription accuracy unexplained by any single recording-level factor, and show that diacritics are a systematic bottleneck with pronounced differences across diacritic types. An orthographic manipulation experiment further shows that how diacritics are represented in context transcriptions substantially affects model performance. These results highlight orthographic complexity and recording-level variation as key practical challenges for ICL-assisted fieldwork transcription.
Developing A Hawaiian Corpus Toolkit for Data-Driven Language Learning
Joseph Winkie | Michol Miller | Winston Wu
Joseph Winkie | Michol Miller | Winston Wu
This paper presents the development of an online multimodal corpus toolkit designed for data-driven language learning in Hawaiian. The toolkit supports corpus linguistics analyses including concordance/KWIC (Key Word In Context) searches, frequency analysis, collocation analyses, and complex queries with n-grams and regex pattern matching. Specifically designed for educators, students, and parents within the Hawaiian community, this easy-to-use tool facilitates a data-driven language learning process by enabling users to explore authentic language data, identify patterns, and develop deeper understanding of Hawaiian language structures through computational methods. By integrating corpus-based approaches into language education, this toolkit contributes significantly to preserving and promoting Hawaiian language learning and supports the broader community’s efforts in language revitalization.
Voice Activation Detection for Transcription of Indigenous Languages
Rolando Coto-Solano | Mikaela Browning | Thomas Corrado | Sally Akevai Nicholas
Rolando Coto-Solano | Mikaela Browning | Thomas Corrado | Sally Akevai Nicholas
Voice Activity Detection (VAD) is the first step in a workflow intended for the automated transcription of Indigenous and low-resource languages. However, VAD’s effectiveness when detecting voices in fieldwork settings remains untested. Fieldwork recordings have very different noise and interference conditions from the datasets that mainstream VAD models have been trained for, and so they might fail when confronted with this type of linguistic data. This paper tests different algorithms using data from two typologically distinct Indigenous languages: Bribri from Costa Rica and Cook Islands Māori from Polynesia. We compare energy-based methods (PyDub), GMM-based methods (WebRTC VAD), and two neural-network based methods (Silero and SpeechBrain) against human-annotated transcriptions. Our results indicate that hybrid architectures like that of SpeechBrain obtain the best results (89% accuracy for Bribri and 94% for Cook Islands Māori). However, no system performed well when tagging non-speech segments, which might indicate a bias towards marking the natural noise in a fieldwork setting as a false-positive for voice. With these findings we hope to inform the selection of VAD tools when implementing ASR workflows.