Devilal Choudhary
2025
Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages
Ashwin Sankar
|
Sparsh Jain
|
Nikhil Narasimhan
|
Devilal Choudhary
|
Dhairya Suman
|
Mohammed Safi Ur Rahman Khan
|
Anoop Kunchukuttan
|
Mitesh M Khapra
|
Raj Dabre
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speech translation for Indian languages remains a challenging task due to the scarcity of large-scale, publicly available datasets that capture the linguistic diversity and domain coverage essential for real-world applications. Existing datasets cover a fraction of Indian languages and lack the breadth needed to train robust models that generalize beyond curated benchmarks. To bridge this gap, we introduce BhasaAnuvaad, the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments across 14 Indian languages and English. Our dataset is built through a threefold methodology: (a) aggregating high-quality existing sources, (b) large-scale web crawling to ensure linguistic and domain diversity, and (c) creating synthetic data to model real-world speech disfluencies. Leveraging BhasaAnuvaad, we train IndicSeamless, a state-of-the-art speech translation model for Indian languages that performs better than existing models. Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation. We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.
Search
Fix author
Co-authors
- Raj Dabre 1
- Sparsh Jain 1
- Mohammed Safi Ur Rahman Khan 1
- Mitesh M. Khapra 1
- Anoop Kunchukuttan 1
- show all...
Venues
- acl1