SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts

Khondoker Ittehadul Islam, Sudipta Kar, Md Saiful Islam, Mohammad Ruhul Amin


Abstract
In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, education, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models. We have made the dataset and accompanying models presented in this paper publicly available at https://git.io/JuuNB.
Anthology ID:
2021.findings-emnlp.278
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3265–3271
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.278
DOI:
10.18653/v1/2021.findings-emnlp.278
Bibkey:
Cite (ACL):
Khondoker Ittehadul Islam, Sudipta Kar, Md Saiful Islam, and Mohammad Ruhul Amin. 2021. SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3265–3271, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts (Islam et al., Findings 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2021.findings-emnlp.278.pdf
Video:
 https://preview.aclanthology.org/paclic-22-ingestion/2021.findings-emnlp.278.mp4
Code
 KhondokerIslam/SentNoB
Data
SentNoB