Data augmentation for low-resource bilingual ASR from Tira linguistic elicitation using Whisper

Mark Simmons


Abstract
This paper explores finetuning Whisper for transcribing audio from linguistic elicitation of Tira, a Heiban language of Sudan. Audio originates from linguistic fieldwork and is bilingual in English and Tira. We finetune Whisper large-v3 using hand-labeled Tira audio and evaluate the resulting model on bilingual audio. We show that Whisper exhibits catastrophic forgetting of English after only a small amount of training, but that including automatically annotated English spans of audio in the training data dramatically reduces catastrophic forgetting of English while largely preserving ASR performance on monolingual Tira audio. This work is relevant to the study of automatic speech recognition for under-resourced languages and for contexts of bilingualism in a high and low-resourced language.
Anthology ID:
2025.computel-main.18
Volume:
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:
March
Year:
2025
Address:
Honolulu, Hawaii, USA
Editors:
Jordan Lachler, Godfred Agyapong, Antti Arppe, Sarah Moeller, Aditi Chaudhary, Shruti Rijhwani, Daisy Rosenblum
Venues:
ComputEL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
155–161
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.18/
DOI:
Bibkey:
Cite (ACL):
Mark Simmons. 2025. Data augmentation for low-resource bilingual ASR from Tira linguistic elicitation using Whisper. In Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 155–161, Honolulu, Hawaii, USA. Association for Computational Linguistics.
Cite (Informal):
Data augmentation for low-resource bilingual ASR from Tira linguistic elicitation using Whisper (Simmons, ComputEL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.18.pdf