Labib Imam Chowdhury
2025
BanNERD: A Benchmark Dataset and Context-Driven Approach for Bangla Named Entity Recognition
Md. Motahar Mahtab
|
Faisal Ahamed Khan
|
Md. Ekramul Islam
|
Md. Shahad Mahmud Chowdhury
|
Labib Imam Chowdhury
|
Sadia Afrin
|
Hazrat Ali
|
Mohammad Mamun Or Rashid
|
Nabeel Mohammed
|
Mohammad Ruhul Amin
Findings of the Association for Computational Linguistics: NAACL 2025
In this study, we introduce BanNERD, the most extensive human-annotated and validated Bangla Named Entity Recognition Dataset to date, comprising over 85,000 sentences. BanNERD is curated from a diverse array of sources, spanning over 29 domains, thereby offering a comprehensive range of generalized contexts. To ensure the dataset’s quality, expert linguists developed a detailed annotation guideline tailored to the Bangla language. All annotations underwent rigorous validation by a team of validators, with final labels being determined via majority voting, thereby ensuring the highest annotation quality and a high IAA score of 0.88. In a cross-dataset evaluation, models trained on BanNERD consistently outperformed those trained on four existing Bangla NER datasets. Additionally, we propose a method named BanNERCEM (Bangla NER context-ensemble Method) which outperforms existing approaches on Bangla NER datasets and performs competitively on English datasets using lightweight Bangla pretrained LLMs. Our approach passes each context separately to the model instead of previous concatenation-based approaches achieving the highest average macro F1 score of 81.85% across 10 NER classes, outperforming previous approaches and ensuring better context utilization. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanNERD in order to contribute to the further advancement of Bangla NLP.