Abusive content detection in transliterated Bengali-English social media corpus

Salim Sazzed


Abstract
Abusive text detection in low-resource languages such as Bengali is a challenging task due to the inadequacy of resources and tools. The ubiquity of transliterated Bengali comments in social media makes the task even more involved as monolingual approaches cannot capture them. Unfortunately, no transliterated Bengali corpus is publicly available yet for abusive content analysis. Therefore, in this paper, we introduce an annotated Bengali corpus of 3000 transliterated Bengali comments categorized into two classes, abusive and non-abusive, 1500 comments for each. For baseline evaluations, we employ several supervised machine learning (ML) and deep learning-based classifiers. We find support vector machine (SVM) shows the highest efficacy for identifying abusive content. We make the annotated corpus freely available for the researcher to aid abusive content detection in Bengali social media data.
Anthology ID:
2021.calcs-1.16
Volume:
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Month:
June
Year:
2021
Address:
Online
Editors:
Thamar Solorio, Shuguang Chen, Alan W. Black, Mona Diab, Sunayana Sitaram, Victor Soto, Emre Yilmaz, Anirudh Srinivasan
Venue:
CALCS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
125–130
Language:
URL:
https://aclanthology.org/2021.calcs-1.16
DOI:
10.18653/v1/2021.calcs-1.16
Bibkey:
Cite (ACL):
Salim Sazzed. 2021. Abusive content detection in transliterated Bengali-English social media corpus. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 125–130, Online. Association for Computational Linguistics.
Cite (Informal):
Abusive content detection in transliterated Bengali-English social media corpus (Sazzed, CALCS 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2021.calcs-1.16.pdf
Code
 sazzadcsedu/abusivecorpus