BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

Mohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan


Abstract
The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.
Anthology ID:
2023.findings-acl.80
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1237–1247
Language:
URL:
https://aclanthology.org/2023.findings-acl.80
DOI:
10.18653/v1/2023.findings-acl.80
Bibkey:
Cite (ACL):
Mohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan, Hasan Mahmud, and Md Kamrul Hasan. 2023. BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1237–1247, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews (Kabir et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2023.findings-acl.80.pdf
Video:
 https://preview.aclanthology.org/emnlp-22-attachments/2023.findings-acl.80.mp4