Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Srihari Bandarupalli, Bhavana Akkiraju, Sri Charan D, Vamshi Raghu Simha Narasinga, Anil Vuppala


Abstract
Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.
Anthology ID:
2025.findings-ijcnlp.99
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venue:
Findings
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
1611–1617
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.99/
DOI:
Bibkey:
Cite (ACL):
Srihari Bandarupalli, Bhavana Akkiraju, Sri Charan D, Vamshi Raghu Simha Narasinga, and Anil Vuppala. 2025. Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 1611–1617, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data (Bandarupalli et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.99.pdf