From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Tianduo Wang; Lu Xu; Wei Lu; Shanbo Cheng

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng

Abstract

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

Anthology ID:: 2025.emnlp-main.629
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12461–12475
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.629/
DOI:
Bibkey:
Cite (ACL):: Tianduo Wang, Lu Xu, Wei Lu, and Shanbo Cheng. 2025. From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12461–12475, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.629.pdf
Checklist:: 2025.emnlp-main.629.checklist.pdf

PDF Cite Search Checklist Fix data