Faithful Transcription: Leveraging Bible Recordings to Improve ASR for Endangered Languages

Eric Le Ferrand, Cian Mohamed Bashar Hauser, Joshua Hartshorne, Emily Prud’hommeaux


Abstract
While automatic speech recognition (ASR) now achieves human-level accuracy for a dozen or so languages, the majority of the world’s languages lack the resources needed to train robust ASR models. For many of these languages, the largest available source of transcribed speech data consists of recordings of the Bible. Bible recordings are appealingly large and well-structured resources, but they have notable limitations: the vocabulary and style are constrained, and the recordings are typically produced by a single speaker in a studio. These factors raise an important question: to what extent are Bible recordings useful for developing ASR models to transcribe contemporary naturalistic speech, the goal of most ASR applications? In this paper, we use Bible recordings alongside contemporary speech recordings to train ASR models in a selection of under-resourced and endangered languages. We find that models trained solely on Bible data yield shockingly weak performance when tested on contemporary everyday speech, even when compared to models trained on other (non-Bible) out-of-domain data. We identify one way of effectively leveraging Bible data in the ASR training pipeline via a two-stage training regime. Our results highlight the need to re-assess reported results relying exclusively on Bible data and to use Bible data carefully and judiciously.
Anthology ID:
2025.ijcnlp-short.28
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:
IJCNLP | AACL
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
333–342
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-short.28/
DOI:
Bibkey:
Cite (ACL):
Eric Le Ferrand, Cian Mohamed Bashar Hauser, Joshua Hartshorne, and Emily Prud’hommeaux. 2025. Faithful Transcription: Leveraging Bible Recordings to Improve ASR for Endangered Languages. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 333–342, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
Faithful Transcription: Leveraging Bible Recordings to Improve ASR for Endangered Languages (Le Ferrand et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-short.28.pdf