BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation

Md. Tofael Ahmed Bhuiyan, Md. Abdur Rahman, Abdul Kadar Muhammad Masum


Abstract
While machine translation has made significant strides for high-resource languages, many regional languages and their dialects, such as the Bangla variants Chittagong and Sylhet, remain underserved. Existing resources are often insufficient for robust sentence-level evaluation and overlook the widespread real-world practice of romanization, the common practice of typing native languages using the Latin script in digital communication. To address these gaps, we introduce BhasaBodh, a comprehensive benchmark for Bangla dialectal machine translation. We construct and release a sentence-level parallel dataset for Chittagong and Sylhet dialects aligned with Standard Bangla and English, create a novel romanized version of the dialectal data to facilitate evaluation in realistic multi-script scenarios, and provide the first comprehensive performance baselines by fine-tuning two powerful multilingual models, NLLB-200 and mBART-50, on seven distinct translation tasks. Our experiments reveal that mBART-50 consistently outperforms NLLB-200 on most dialectal and romanized tasks, achieving a BLEU score as high as 87.44 on the Romanized-to-Standard Bangla normalization task. However, complex cross-lingual and cross-script translation remains a significant challenge. BhasaBodh lays the groundwork for future research in low-resource dialectal NLP, offering a valuable resource for developing more inclusive and practical translation systems.
Anthology ID:
2025.banglalp-1.9
Volume:
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
Venues:
BanglaLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
113–118
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.9/
DOI:
Bibkey:
Cite (ACL):
Md. Tofael Ahmed Bhuiyan, Md. Abdur Rahman, and Abdul Kadar Muhammad Masum. 2025. BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 113–118, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation (Bhuiyan et al., BanglaLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.9.pdf