Richard Wright
2026
The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR
Siyu Liang | Nicolas Ballier | Gina-Anne Levow | Richard Wright
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Siyu Liang | Nicolas Ballier | Gina-Anne Levow | Richard Wright
Proceedings of the Fifteenth Language Resources and Evaluation Conference
How much audio is needed to fully observe a multilingual ASR model’s learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper’s decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model’s sub-token space. Results show that the total number of discovered tokens remains largely independent of a language’s pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model’s hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new token activation. We refer to this convergence threshold as acoustic saturation time AST. Further analyses of rank–frequency distributions reveal Zipf-like patterns better modeled by a Zipf–Mandelbrot law, and mean sub-token length shows a positive correlation with resource level. Additionally, those metrics show more favorable patterns for languages in the Latin script than those in scripts such as Cyrillic, CJK and Semitic. Together, our study suggests that sub-token utilization during multilingual ASR inference is constrained more by the statistical, typological, and orthographical structure of the speech than by training data scale, providing an empirical basis for more equitable corpus construction and cross-lingual evaluation.
2025
Beyond WER: Probing Whisper’s Sub‐token Decoder Across Diverse Language Resource Levels
Siyu Liang | Nicolas Ballier | Gina-Anne Levow | Richard Wright
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Siyu Liang | Nicolas Ballier | Gina-Anne Levow | Richard Wright
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper’s multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from higher likelihood of the correct token being top-ranked, greater confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA and t-SNE analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.
2024
Probing Whisper Predictions for French, English and Persian Transcriptions
Nicolas Ballier | Léa Burin | Behnoosh Namdarzadeh | Sara Ng | Richard Wright | Jean-Baptiste Yunès
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)
Nicolas Ballier | Léa Burin | Behnoosh Namdarzadeh | Sara Ng | Richard Wright | Jean-Baptiste Yunès
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)
2005
The Vocal Joystick: A Voice-Based Human-Computer Interface for Individuals with Motor Impairments
Jeff A. Bilmes | Xiao Li | Jonathan Malkin | Kelley Kilanski | Richard Wright | Katrin Kirchhoff | Amar Subramanya | Susumu Harada | James Landay | Patricia Dowden | Howard Chizeck
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
Jeff A. Bilmes | Xiao Li | Jonathan Malkin | Kelley Kilanski | Richard Wright | Katrin Kirchhoff | Amar Subramanya | Susumu Harada | James Landay | Patricia Dowden | Howard Chizeck
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing