Mohaymen Ul Anam
2025
Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?
Tawsif Tashwar Dipto
|
Azmol Hossain
|
Rubayet Sabbir Faruque
|
Md. Rezuwan Hassan
|
Kanij Fatema
|
Tanmoy Shome
|
Ruwad Naswan
|
Md.Foriduzzaman Zihad
|
Mohaymen Ul Anam
|
Nazia Tasnim
|
Hasan Mahmud
|
Md Kamrul Hasan
|
Md. Mehedi Hasan Shawon
|
Farig Sadeque
|
Tahsin Reasat
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations, but dialect specific model training alleviates the issue. Our dataset also serves as a out-of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available.