ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language
Md. Abdur Rahman, Md. Tofael Ahmed Bhuiyan, Abdul Kadar Muhammad Masum
Abstract
The advancement of NLP technologies for low-resource and endangered languages is critically hindered by the scarcity of high-quality, parallel corpora. This is particularly true for languages like Chakma, which also faces the challenge of prevalent non-standard, romanized script usage in digital communication. To address this, we introduce ChakmaBridge, the first five-way parallel corpus for Chakma, containing 807 sentences aligned across English, Standard Bangla, Bengali-script Chakma, Romanized Bangla, and Romanized Chakma. Our dataset is created by augmenting the MELD corpus with LLM-generated romanizations that are rigorously validated by native speakers. We establish robust machine translation baselines across six diverse language and script pairs. Our experiments reveal that a multilingual training approach, combining English and Bangla as source languages, yields a dramatic performance increase, achieving a BLEU score of 0.5228 for Chakma translation, a 124% relative improvement over the best bilingual model. We release ChakmaBridge to facilitate research in low-resource MT and aid in the digital preservation of this endangered language.- Anthology ID:
- 2025.banglalp-1.21
- Volume:
- Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
- Venues:
- BanglaLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 259–265
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.21/
- DOI:
- Cite (ACL):
- Md. Abdur Rahman, Md. Tofael Ahmed Bhuiyan, and Abdul Kadar Muhammad Masum. 2025. ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 259–265, Mumbai, India. Association for Computational Linguistics.
- Cite (Informal):
- ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language (Rahman et al., BanglaLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.21.pdf