Workshop on the Use of Computational Methods in the Study of Endangered Languages (2025)

Volumes

Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages 23 papers

pdf (full)
bib (full) Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib abs
Formalizing the Morphology of Rromani Adjectives
Masako Watabe | Max Silberztein

This paper presents a set of linguistic resources that formalizes the morphological behavior of simple Rromani adjectives. We describe the formalization of the adjectives’ morphology and the implementation with the NooJ linguistic platform of an electronic dictionary associated with a formal morpho-syntactic grammar. We can then apply this set of resources to a corpus to evaluate the resources and automatically annotate adjectival forms in Rromani texts. The final set of resources can then be used to identify each Rromani dialectal variant and can be used as a pedagogical tool to teach Rromani as a second language.

pdf bib abs
Bilingual Sentence Mining for Low-Resource Languages: a Case Study on Upper and Lower Sorbian
Shu Okabe | Alexander Fraser

Parallel sentence mining is crucial for down- stream tasks such as Machine Translation, especially for low-resource languages, where such resources are scarce. In this context, we apply a pipeline approach with contextual embeddings on two endangered Slavic languages spoken in Germany, Upper and Lower Sorbian, to evaluate mining quality. To this end, we compare off-the-shelf multilingual language models and word encoders pre-trained on Upper Sorbian to understand their impact on sentence mining. Moreover, to filter out irrelevant pairs, we experiment with a post-processing of mined sentences through an unsupervised word aligner based on word embeddings. We observe the usefulness of additional pre-training in Upper Sorbian, which leads to direct improvements when mining the same language but also its related language, Lower Sorbian.

pdf bib abs
Citizen linguists and decolonial lexicography: Co-creative dictionary-building in grassroots digital language documentation
Anna Luisa Daigneault | Gregory Anderson

Many endangered, under-represented, minority and Indigenous language communities around the world need access to multilingual online resources to survive in the digital age. The Living Dictionaries platform provides a collaborative online space for professional linguists and citizen-linguists alike to produce their own grassroots digital dictionaries that include multimedia such as audio recordings and images. These online lexica can play an important role in assisting present and future generations in combatting language loss and creating visibility for their languages and cultures on the Internet.

The SENCOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENCOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENCOTEN dataset show aword error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors,WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENCOTEN language documentation.

pdf bib abs
Speech Technologies with Fieldwork Recordings: the Case of Haitian Creole
William N. Havard | Renauld Govain | Benjamin Lecouteux | Emmanuel Schang

We use 40-year-old digitalised tape-recorded fieldwork data in Haitian Creole to train a native self-supervised learning (SSL) model of speech representation (WAV2VEC2). We also use a continued pre-training approach on pre-trained SSL models of two foreign languages the lexifier language – French – and an unrelated language – English. We compare the performances of these three SSL models, and of two other foreign SSL models directly finetuned, on an ASR task, where all five models are fine-tuned on transcribed fieldwork recordings in Haitian Creole. Our results show the best-performing model is the one trained using a continued pre-training approach on the lexifier language, followed by the native model. We conclude that the ‘mobilising the archive’-approach advocated by (Bird, 2020) is a promising way forward to design speech technologies for new languages.

pdf bib abs
Evaluating Indigenous language speech synthesis for education: A participatory design workshop on Ojibwe text-to-speech
Viann Sum Yat Chan | Christopher Hammerly

This paper reports methods and results from a participatory design workshop aimed at evaluating the use of speech synthesis and text-to-speech for Ojibwe language education. Using an existing text-to-speech feature as a starting point, we worked with two groups of Ojibwe language instructors using a guided trial of the speech synthesis system and a two hour semistructured workshop with the aim of creating a lesson plan that utilizes text-to-speech. We highlight the insights from this work, both in how to design and deliver speech synthesis systems for Indigenous language education, but also how to approach and design such a workshop to ensure a fruitful discourse.

Approximate search is a valuable component of online dictionaries for learners, allowing them to find words even when they have not fully mastered the orthography or cannot reliably perceive phonemic differences in the language. However, evaluating the performance of different approximate search algorithms remains difficult in the absence of real user queries. We detail several methods for generating synthetic queries representing various user personas. We then compare the performance of several search algorithms on both real and synthetic queries in two Indigenous languages, SENĆOŦEN and Michif, that are phonologically and morphologically very different from English.

pdf bib abs
Exploring Limitations and Risks of LLM-Based Grammatical Error Correction for Indigenous Languages
Flammie A Pirinen | Linda Wiechetek

Rule-based grammatical error correction has long been seen as the most effective way to create user-friendly end-user systems for gram- matical error correction (GEC). However, in the recent years the large language models and generative AI systems based on that technol- ogy have been progressed fast to challenge the traditional GEC approach. In this article we show which possibilities and limitations this approach bears for Indigenous languages that have more limited digital presence in the large language model data and a different literacy background than English. We show experi- ments in North Sámi, an Indigenous language of Northern Europe.

The expansion of the speech technology sector has given rise to a novel economic model in language research, with the objective of developing speech datasets. This model is expanding to under-served African languages through collaborative efforts between industries, organisations, and the active participation of communities. This collaboration is yielding new datasets for machine learning, while also disclosing vulnerabilities and sociolinguistic discrepancies between industrialised and non-industrialised societies. A case study of a speech data collection camp that took place in September 2024 in Cameroon, involving representatives of 31 languages throughout the continent, illustrates both the prospects of the new economic model for research on under-served languages and the challenges of fair, effective, and responsible participation.

pdf bib abs
Towards a Hän morphological transducer
Maura O’Leary | Joseph Lukner | Finn Verdonk | Willem de Reuse | Jonathan Washington

This paper presents work towards a morphologi- cal transducer for Hän, a Dene language spoken in Alaska and the Yukon Territory. We present the implementation of several complex morpho- logical features of Dene languages into a morpho- logical transducer, an evaluation of the transducer on corpus data, and a discussion of the future uses of such a transducer towards Hän revitalization ef- forts.

pdf bib abs
Multilingual MFA: Forced Alignment on Low-Resource Related Languages
Alessio Tosolini | Claire Bowern

We compare the outcomes of multilingual and crosslingual training for related and unrelated Australian languages with similar phonologi- cal inventories. We use the Montreal Forced Aligner to train acoustic models from scratch and adapt a large English model, evaluating results against seen data, unseen data (seen lan- guage), and unseen data and language. Results indicate benefits of adapting the English base- line model for previously unseen languages.

pdf bib abs
Creating an intelligent dictionary of Tsuut’ina one verb at a time
Christopher Cox | Bruce Starlight | Janelle Crane-Starlight | Hanna Big Crow | Antti Arppe

In this paper, we discuss the development of a long-term partnership between community and university-based language workers to create supportive language technologies for Tsuutina, a critically endangered Dene language spoken in southern Alberta, Canada. Initial development activities in this partnership sought to rapidly integrate existing language materials, with the aim of arriving at tools that would be effective and impactful for community use by virtue of their extensive lexical coverage. We describe how, as this partnership developed, this approach was gradually superseded by one that involved a more targeted, lexical-item-by-lexical-item review process that was directly informed by other community language priorities and connected to the work a local language authority. We describe how this shift in processes correlated with other changes in local language programs and priorities, noting how ongoing communication allowed this partnership to adapt to the evolving needs of local organizations.

pdf bib abs
AILLA-OCR: A First Textual and Structural Post-OCR Dataset for 8 Indigenous Languages of Latin America
Milind Agarwal | Antonios Anastasopoulos

It is by now common knowledge in the NLP community that low-resource languages need large-scale data creation efforts and novel con- tributions in the form of robust algorithms that work in data-scarce settings. Amongst these languages, however, many have a large amount of data, ripe for NLP applications, except that this data exists in image-based formats. This includes scanned copies of extremely valuable dictionaries, linguistic field notes, children’s stories, plays, and other textual material. To extract the text data from these non machine- readable images, Optical Character Recogni- tion (OCR) is the most popular technique, but it has proven to be challenging for low-resource languages because of their unique properties (uncommon diacritics, rare words etc.) and due to a general lack of preserved page-structure in the OCR output. So, to contribute to the reduc- tion of these two big bottlenecks (lack of text data and layout quality), we release the first textual and structural OCR dataset for 8 indige- nous languages of Latin America. We hope that our dataset will encourage researchers within the NLP and Computational Linguistics com- munities to work with these languages.

pdf bib abs
Connecting Automated Speech Recognition to Transcription Practices
Blaine Billings | Bradley McDonnell

One of the greatest issues facing documentary linguists is the transcription bottleneck. While the large quantity of audio and video data gener- ated as part of a documentary project serves as a long-lasting record of the language, without corresponding text transcriptions, it remains largely inaccessible for revitalization efforts and linguistic analysis. Automated Speech Recognition (ASR) is frequently proposed as the solution to this problem. However, two is- sues often prevent documentary linguists from making use of ASR models 1) the thought that the typical documentary project does not have sufficient data to develop an adequate ASR model and 2) that correcting the output of an ASR model would be more time-consuming for transcribers than simply creating a transcription from scratch. In this paper, we tackle both of these issues by developing an ASR model in the larger context of a documentation project for Nasal, a low-resource language of western Indonesia. Fine-tuning a larger pre-trained lan- guage model on 25 hours of transcribed Nasal speech, we produce a model that has a 44% word error rate. Despite this relatively high error rate, tests comparing speed of transcrib- ing from scratch and correcting ASR-generated transcripts show that the ASR model can sig- nificantly speed up the transcription process.

pdf bib abs
Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak’wala Legacy Texts
Milind Agarwal | Antonios Anastasopoulos | Daisy Rosenblum

Kwak’wala is an Indigenous language spoken in British Columbia, with a rich legacy of pub- lished documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revi- talization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete dig- itization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we ap- ply the latest OCR techniques to a series of Kwak’wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the- shelf OCR methods, language identification, and masking to effectively isolate Kwak’wala text, along with post-correction models, to pro- duce a final high-quality transcription.

This paper describes the process and learn- ing outcomes of a three-day workshop on ma- chine learning basics for documentary linguists. During this workshop, two groups of linguists working with two Indigenous languages of North America, Blackfoot and Dënë Su ̨łıné, became acquainted with machine learning prin- ciples, explored how machine learning can be used in data processing for under-resourced languages and then applied different machine learning methods for automatic morphologi- cal interlinearization and parts-of-speech tag- ging. As a result, participants discovered paths to greater collaboration between computer sci- ence and documentary linguistics and reflected on how linguists might be enabled to apply ma- chine learning with less dependence on experts.

pdf bib abs
Universal Dependencies for Amahuaca
Candy Angulo | Pilar Valenzuela | Roberto Zariquiey

This paper presents the creation of a Universal Dependency (UD) treebank for Amahuaca (Peru), marking the first UD treebank within the Headwaters subbranch of the Panoan family, spoken mostly in Peru and Brazil. While the UD guidelines provided a general framework for our annotations, language-specific decisions were necessary due to the rich morphology of the Amahuaca language. The paper also describes specific constructions to initiate a discussion on several general UD annotation guidelines, particularly those concerning clitics and morpheme-level dependencies.

pdf bib abs
Data augmentation for low-resource bilingual ASR from Tira linguistic elicitation using Whisper
Mark Simmons

This paper explores finetuning Whisper for transcribing audio from linguistic elicitation of Tira, a Heiban language of Sudan. Audio originates from linguistic fieldwork and is bilingual in English and Tira. We finetune Whisper large-v3 using hand-labeled Tira audio and evaluate the resulting model on bilingual audio. We show that Whisper exhibits catastrophic forgetting of English after only a small amount of training, but that including automatically annotated English spans of audio in the training data dramatically reduces catastrophic forgetting of English while largely preserving ASR performance on monolingual Tira audio. This work is relevant to the study of automatic speech recognition for under-resourced languages and for contexts of bilingualism in a high and low-resourced language.

pdf bib abs
Integrating diverse corpora for training an endangered language machine translation system
Hunter Scheppat | Joshua Hartshorne | Dylan Leddy | Eric Le Ferrand | Emily Prudhommeaux

Machine translation (MT) can be a useful technology for language documentation and for promoting language use in endangered language communities. Few endangered languages, however, have an existing parallel corpus large enough to train a reasonable MT model. In this paper, we re-purpose a wide range of diverse data sources containing Amis, English, and Mandarin text to serve as parallel corpora for training MT systems for Amis, one of the Indigenous languages of Taiwan. To supplement the small amount of Amis-English data, we produce synthetic Amis-English data by using a high quality MT system to generate English translations for the Mandarin side of the Amis-Mandarin corpus. Using two popular neural MT systems, OpenNMT and NLLB, we train models to translate between English and Amis, and Mandarin and Amis. We find that including synthetic data is helpful only when translating to English. In addition, we observe that neither MT architecture is consistently superior to other and that performance seems to vary according to the direction of translation and the amount of data used. These results indicate that MT is possible for an under-resourced language even without a formally prepared parallel corpus, but multiple training methods should be explored to produce optimal results.

pdf bib abs
Comparing efficacy of IPA vs Pinyin romanisation transcriptions for complex tonal languages: A case study in Baima
Katia Chirkova | Rolando Coto-Solano | Rachael Griffiths | Marieke Meelen

How is automated tone transcription affected by the choice of transcription orthography? In this paper we present a range of experiments that indicate that, even when the tonal repre- sentations are kept the same, the way vowels and consonants are transcribed can affect tonal character outputs. Our results also indicate that using a Language Model (LM) for decoding can mitigate problems with tonal outputs, but tones remain the most difficult part of the tran- scription. In doing this we also present the first Automatic Speech Recognition (ASR) models for the Baima language, spoken in Sichuan and Gansu, China. We hope to use these models to contribute to ongoing documentation efforts.

pdf bib abs
Kuene: A Web Platform for Facilitating Hawaiian Word Neologism
Sunny Walker | Winston Wu | Bruce Torres Fischer | Larry Kimura

This paper presents Kuene, a web-based collaborative dictionary editing platform designed to facilitate the creation and publication of Hawaiian neologisms by the Hawaiian Lexicon Committee. Through Kuene, the Committee can create, edit, and refine new dictionary entries with a multi-round approval process, ensuring accuracy and consistency. The platform’s tech- nical features enable flexible access control, fine-grained approval states, and support for multimedia content and AI-assisted orthogra- phy modernization. Just in the past two months, Kuene has enabled the publication of over 400 new Hawaiian words. By streamlining the dic- tionary editing process, Kuene aims to alleviate the scarcity of modern Hawaiian words and fa- cilitate the revitalization efforts of the Hawaiian

pdf bib abs
Evaluation of Morphological Segmentation Methods for Hupa
Nathaniel Parkes | Zoey Liu

Building downstream NLP applications with tokenization systems built on morphological segmentation has been shown to be fruitful for certain morphologically-rich languages. Yet, indigenous and endangered languages, which tend to be highly polysynthetic, thereby a po- tential beneficiary of this approach, pose ad- ditional difficulties in their limited access to annotated data for morphological segmenta- tion tasks. In this study, we develop mor- phological segmentation models for Hupa, a Dene/Athabaskan language critically endan- gered to North America. With a total of 595 word types, we seek to identify an optimal mor- phological segmentation model and illustrate how those tested perform under different levels of training data limitation. We propose a simple method that casts morphological segmentation as a sequence binary classification task. While this approach does not outperform the estab- lished practice of multi-class classification, it outperforms neural alternatives. This work is conducted under the intention to act as a start- ing point for future technological developments with Hupa looking to leverage its morpholog- ical qualities, which we hope can serve as a reflection for work with other indigenous lan- guages being studied under similar constraints.