Olga Lovick


2026

While several pre-trained multilingual models are actively used for fine-tuning on under-resourced and endangered languages, it remains unclear which architectures perform better and what factors explain their varying performance across languages. Although this question may be less pressing for languages with adequate resources, it is critical for endangered language communities, where limited time and funding to experiment with multiple model options are available (Jimerson et al., 2023). We compare the performance of two ASR architectures, Wav2Vec2 and Whisper, on a Dënë Sųłıné dataset. This language and dataset present several challenges common to under-resourced and endangered languages: unstandardized orthography, pronunciation variation, and phonological and morphosyntactic structures that differ from the major languages represented in the multilingual datasets used for pre-training large ASR models. Although Wav2Vec2 reportedly outperforms Whisper in low-resource settings (see e.g., Coto-Solano et al., 2024; Nahabwe et al., 2025; Williams et al., 2023), our study shows that Whisper yields significantly better results on the Dënë Sųłıné dataset. These findings suggest that model performance may depend not only on architecture, dataset size, or typological features of language, but also on dataset-specific characteristics. In our case, Whisper showed better adaptability to a dataset with inconsistent spelling and pronunciation. Further verification across similarly inconsistent datasets is required to assess the generalizability of this result.
This paper presents a study focused on advancing Automatic Speech Recognition (ASR) for the under-resourced language Dënë Sųłıné through data-centric approaches. We explore multiple strategies to enhance the quality of training data—both audio recordings and transcriptions—to address the challenges posed by mixed-quality datasets. Our experiments investigate which data preparation techniques most effectively improve ASR performance in this context. Our findings show that reducing non-phonemic spelling variation in the corpus significantly improves model generalization, resulting in a substantial increase in recognition accuracy. Additionally, we demonstrate that increasing manually reviewed transcriptions consistently improves word and character error rates, while audio enhancement slightly reduces performance, highlighting the complex trade-offs in low-resource ASR development.

2025

This paper describes the process and learn- ing outcomes of a three-day workshop on ma- chine learning basics for documentary linguists. During this workshop, two groups of linguists working with two Indigenous languages of North America, Blackfoot and Dënë Su ̨łıné, became acquainted with machine learning prin- ciples, explored how machine learning can be used in data processing for under-resourced languages and then applied different machine learning methods for automatic morphologi- cal interlinearization and parts-of-speech tag- ging. As a result, participants discovered paths to greater collaboration between computer sci- ence and documentary linguistics and reflected on how linguists might be enabled to apply ma- chine learning with less dependence on experts.

2018

2016

This paper describes a repository of example sentences in three endangered Athabascan languages: Koyukon, Upper Tanana, Lower Tanana. The repository allows researchers or language teachers to browse the example sentence corpus to either investigate the languages or to prepare teaching materials. The originally heterogeneous text collection was imported into a SOLR store via the POIO bridge. This paper describes the requirements, implementation, advantages and drawbacks of this approach and discusses the potential to apply it for other languages of the Athabascan family or beyond.