BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation
Md. Tofael Ahmed Bhuiyan, Md. Abdur Rahman, Abdul Kadar Muhammad Masum
Abstract
While machine translation has made significant strides for high-resource languages, many regional languages and their dialects, such as the Bangla variants Chittagong and Sylhet, remain underserved. Existing resources are often insufficient for robust sentence-level evaluation and overlook the widespread real-world practice of romanization, the common practice of typing native languages using the Latin script in digital communication. To address these gaps, we introduce BhasaBodh, a comprehensive benchmark for Bangla dialectal machine translation. We construct and release a sentence-level parallel dataset for Chittagong and Sylhet dialects aligned with Standard Bangla and English, create a novel romanized version of the dialectal data to facilitate evaluation in realistic multi-script scenarios, and provide the first comprehensive performance baselines by fine-tuning two powerful multilingual models, NLLB-200 and mBART-50, on seven distinct translation tasks. Our experiments reveal that mBART-50 consistently outperforms NLLB-200 on most dialectal and romanized tasks, achieving a BLEU score as high as 87.44 on the Romanized-to-Standard Bangla normalization task. However, complex cross-lingual and cross-script translation remains a significant challenge. BhasaBodh lays the groundwork for future research in low-resource dialectal NLP, offering a valuable resource for developing more inclusive and practical translation systems.- Anthology ID:
- 2025.banglalp-1.9
- Volume:
- Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
- Venues:
- BanglaLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 113–118
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.9/
- DOI:
- Cite (ACL):
- Md. Tofael Ahmed Bhuiyan, Md. Abdur Rahman, and Abdul Kadar Muhammad Masum. 2025. BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 113–118, Mumbai, India. Association for Computational Linguistics.
- Cite (Informal):
- BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation (Bhuiyan et al., BanglaLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.9.pdf