Kathy Reid
2025
On the Tolerance of Repetition Before Performance Degradation in Kiswahili Automatic Speech Recognition
Kathleen Siminyu
|
Kathy Reid
|
Ryakitimboruby@gmail.com Ryakitimboruby@gmail.com
|
Bmwasaru@gmail.com Bmwasaru@gmail.com
|
Chenai@chenai.africa Chenai@chenai.africa
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)
State of the art end-to-end automatic speech recognition (ASR) models require large speech datasets for training. The Mozilla Common Voice project crowd-sources read speech to address this need. However, this approach often results in many audio utterances being recorded for each written sentence. Using Kiswahili speech data, this paper first explores how much audio repetition in utterances is permissible in a training set before model degradation occurs, then examines the extent to which audio augmentation techniques can be employed to increase the diversity of speech characteristics and improve accuracy. We find that repetition up to a ratio of 1 sentence to 8 audio recordings improves performance, but performance degrades at a ratio of 1:16. We also find small improvements from frequency mask, time mask and tempo augmentation. Our findings provide guidance on training set construction for ASR practitioners, particularly those working in under-served languages.
2023
Right the docs: Characterising voice dataset documentation practices used in machine learning
Kathy Reid
|
Elizabeth T. Williams
Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association
Voice-enabled technologies such as virtual assistants are quickly becoming ubiquitous. Their functionality relies on machine learning (ML) models that perform tasks such as automatic speech recognition (ASR). These models, in general, currently perform less accurately for some cohorts of speakers, across axes such as age, gender and accent; they are biased. ML models are trained from large datasets. ML Practitioners (MLPs) are interested in addressing bias across the ML lifecycle, and they often use dataset documentation here to understand dataset characteristics. However, there is a lack of research centred on voice dataset documentation. Our work makes an empirical contribution to this gap, identifying shortcomings in voice dataset documents (VDD), and arguing for actions to improve them. First, we undertake 13 interviews with MLPs who work with voice data, exploring how they use VDDs. We focus here on MLP roles and trade-offs made when working with VDDs. Drawing from the literature and from interview data, we create a rubric through which to analyse VDDs for nine voice datasets. Triangulating the two methods in our findings, we show that VDDs are inadequate for the needs of MLPs on several fronts. VDDs currently codify voice data characteristics in fragmented ways that make it difficult to compare and combine datasets, presenting a barrier to MLPs’ bias reduction efforts. We then seek to address these shortcomings and “right the docs” by proposing improvement actions aligned to our findings.