BanglaByT5: Byte-Level Modelling for Bangla

Pramit Bhattacharyya, Arnab Bhattacharya


Abstract
Large language models (LLMs) have achievedremarkable success across various natural lan-guage processing tasks. However, most LLMmodels use traditional tokenizers like BPE andSentencePiece, which fail to capture the finernuances of a morphologically rich languagelike Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla.Built upon a small variant of Google’s ByT5architecture, BanglaByT5 is pre-trained on a14GB curated corpus combining high-qualityliterary and newspaper articles. Through zero-shot and supervised evaluations across gen-erative and classification tasks, BanglaByT5demonstrates competitive performance, surpassing several multilingual and larger models.Our findings highlight BanglaByT5’s potentialas a lightweight yet powerful tool for BanglaNLP, particularly in resource-constrained orscalable environments. BanglaByT5 is pub-licly available for download from https://huggingface.co/Vacaspati/BanglaByT5.
Anthology ID:
2025.findings-emnlp.297
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5551–5560
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.297/
DOI:
10.18653/v1/2025.findings-emnlp.297
Bibkey:
Cite (ACL):
Pramit Bhattacharyya and Arnab Bhattacharya. 2025. BanglaByT5: Byte-Level Modelling for Bangla. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5551–5560, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
BanglaByT5: Byte-Level Modelling for Bangla (Bhattacharyya & Bhattacharya, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.297.pdf
Checklist:
 2025.findings-emnlp.297.checklist.pdf