That doesn’t sound right: Evaluating speech transcription quality in field linguistics corpora
Eric Le Ferrand, Bo Jiang, Joshua Hartshorne, Emily Prud’hommeaux
Abstract
Incorporating automatic speech recognition (ASR) into field linguistics workflows for language documentation has become increasingly common. While ASR performance has seen improvements in low-resource settings, obstacles remain when training models on data collected by documentary linguists. One notable challenge lies in the way that this data is curated. ASR datasets built from spontaneous speech are typically recorded in consistent settings and transcribed by native speakers following a set of well designed guidelines. In contrast, field linguists collect data in whatever format it is delivered by their language consultants and transcribe it as best they can given their language skills and the quality of the recording. This approach to data curation, while valuable for linguistic research, does not always align with the standards required for training robust ASR models. In this paper, we explore methods for identifying speech transcriptions in fieldwork data that may be unsuitable for training ASR models. We focus on two complimentary automated measures of transcription quality that can be used to identify transcripts with characteristics that are common in field data but could be detrimental to ASR training. We show that one of the metrics is highly effective at retrieving these types of transcriptions. Additionally, we find that filtering datasets using this metric of transcription quality reduces WER both in controlled experiments using simulated fieldwork with artificially corrupted data and in real fieldwork corpora.- Anthology ID:
- 2025.acl-short.49
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 627–635
- Language:
- URL:
- https://preview.aclanthology.org/landing_page/2025.acl-short.49/
- DOI:
- Cite (ACL):
- Eric Le Ferrand, Bo Jiang, Joshua Hartshorne, and Emily Prud’hommeaux. 2025. That doesn’t sound right: Evaluating speech transcription quality in field linguistics corpora. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 627–635, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- That doesn’t sound right: Evaluating speech transcription quality in field linguistics corpora (Le Ferrand et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/landing_page/2025.acl-short.49.pdf