BOIGENRE: A Large-Scale Bangla Dataset for Genre Classification from Book Summaries

Rafi Hassan Chowdhury, Rahanuma Ryaan Ferdous


Abstract
The classification of literary genres plays a vital role in digital humanities and natural language processing (NLP), supporting tasks such as content organization, recommendation, and linguistic analysis. However, progress for the Bangla language remains limited due to the lack of large, structured datasets. To address this gap, we present BOIGENRE, the first large-scale dataset for Bangla book genre classification, built from publicly available summaries. The dataset contains 25,951 unique samples across 16 genres, showcasing diversity in narrative style, vocabulary, and linguistic expression. We provide statistical insights into text length, lexical richness, and cross-genre vocabulary overlap. To establish benchmarks, we evaluate traditional machine learning, neural, and transformer-based models. Results show that while unigram-based classifiers perform reasonably, transformer models, particularly BanglaBERT, achieve the highest F1-score of 69.62%. By releasing BOIGENRE and baseline results, we offer a valuable resource and foundation for future research in Bangla text classification and low-resource NLP.
Anthology ID:
2025.banglalp-1.20
Volume:
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
Venues:
BanglaLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
249–258
Language:
URL:
https://preview.aclanthology.org/old-master/2025.banglalp-1.20/
DOI:
Bibkey:
Cite (ACL):
Rafi Hassan Chowdhury and Rahanuma Ryaan Ferdous. 2025. BOIGENRE: A Large-Scale Bangla Dataset for Genre Classification from Book Summaries. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 249–258, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
BOIGENRE: A Large-Scale Bangla Dataset for Genre Classification from Book Summaries (Chowdhury & Ferdous, BanglaLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/old-master/2025.banglalp-1.20.pdf