Kazi Samin Mubasshir

2023

pdf
SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis
Imtiaz Karim | Kazi Samin Mubasshir | Mirza Masfiqur Rahman | Elisa Bertino
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

2022

In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed ‘Bangla2B+’) by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.

Co-authors

M. Sohel Rahman 1

Rifat Shahriyar 1

Imtiaz Karim 1

Mirza Masfiqur Rahman 1

Elisa Bertino 1

Venues

findings2