Addressing Domain Mismatch in ASR for Akuzipik Language Documentation

Summer Chambers, Sylvia Woodrose Schwartz, Matthew Kelley, Lane Woodrose Schwartz


Abstract
The use of ASR models in endangered language documentation has grown in popularity given the bottleneck of manual speech transcription. Meta’s Massively Multilingual Speech (MMS) model is particularly popular for its extensibility to low-resource languages. However, it is mostly trained on read speech data from the Bible, meaning it may not perform well on other domains. We evaluated this model on data collected as part of a larger language documentation and revitalization project focused on Akuzipik, a polysynthetic Alaska Native language. We also finetuned and evaluated the model on a small (1h) collection of speech. The original model performed well on a dataset that roughly matched the Bible training data in domain and writing style but struggled on a separate collection of spontaneous speech. Performance on spontaneous speech improved after finetuning on a sample of our full dataset, and error rates reduced less dramatically after finetuning only on read speech. Both finetuning scenarios show promise for future model improvement, especially considering the relative ease of collecting read speech data. This experiment confirms the challenge of transcribing spontaneous speech with the MMS ASR model but provides hope for improving model performance for language documentation purposes, even with scarce data.
Anthology ID:
2026.computel-1.10
Volume:
Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Godfred Agyapong, Sarah Moeller, Antti Arppe, Ali Marashian, Daisy Rosenblum
Venues:
ComputEL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
93–103
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.computel-1.10/
DOI:
Bibkey:
Cite (ACL):
Summer Chambers, Sylvia Woodrose Schwartz, Matthew Kelley, and Lane Woodrose Schwartz. 2026. Addressing Domain Mismatch in ASR for Akuzipik Language Documentation. In Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9), pages 93–103, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Addressing Domain Mismatch in ASR for Akuzipik Language Documentation (Chambers et al., ComputEL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.computel-1.10.pdf
Supplementarymaterial:
 2026.computel-1.10.SupplementaryMaterial.txt
Supplementarymaterial:
 2026.computel-1.10.SupplementaryMaterial.zip