A data-centric approach to performance improvement in under-resourced ASR: The case of Dënë Sųłıné

Olga Kriukova; Olga Lovick; Antti Arppe

A data-centric approach to performance improvement in under-resourced ASR: The case of Dënë Sųłıné

Abstract

This paper presents a study focused on advancing Automatic Speech Recognition (ASR) for the under-resourced language Dënë Sųłıné through data-centric approaches. We explore multiple strategies to enhance the quality of training data—both audio recordings and transcriptions—to address the challenges posed by mixed-quality datasets. Our experiments investigate which data preparation techniques most effectively improve ASR performance in this context. Our findings show that reducing non-phonemic spelling variation in the corpus significantly improves model generalization, resulting in a substantial increase in recognition accuracy. Additionally, we demonstrate that increasing manually reviewed transcriptions consistently improves word and character error rates, while audio enhancement slightly reduces performance, highlighting the complex trade-offs in low-resource ASR development.

Anthology ID:: 2026.americasnlp-6.9
Volume:: Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:: AmericasNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 95–106
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.9/
DOI:
Bibkey:
Cite (ACL):: Olga Kriukova, Olga Lovick, and Antti Arppe. 2026. A data-centric approach to performance improvement in under-resourced ASR: The case of Dënë Sųłıné. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 95–106, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: A data-centric approach to performance improvement in under-resourced ASR: The case of Dënë Sųłıné (Kriukova et al., AmericasNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.9.pdf

PDF Cite Search Fix data