Abstract
Abusive text detection in low-resource languages such as Bengali is a challenging task due to the inadequacy of resources and tools. The ubiquity of transliterated Bengali comments in social media makes the task even more involved as monolingual approaches cannot capture them. Unfortunately, no transliterated Bengali corpus is publicly available yet for abusive content analysis. Therefore, in this paper, we introduce an annotated Bengali corpus of 3000 transliterated Bengali comments categorized into two classes, abusive and non-abusive, 1500 comments for each. For baseline evaluations, we employ several supervised machine learning (ML) and deep learning-based classifiers. We find support vector machine (SVM) shows the highest efficacy for identifying abusive content. We make the annotated corpus freely available for the researcher to aid abusive content detection in Bengali social media data.- Anthology ID:
- 2021.calcs-1.16
- Volume:
- Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Editors:
- Thamar Solorio, Shuguang Chen, Alan W. Black, Mona Diab, Sunayana Sitaram, Victor Soto, Emre Yilmaz, Anirudh Srinivasan
- Venue:
- CALCS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 125–130
- Language:
- URL:
- https://aclanthology.org/2021.calcs-1.16
- DOI:
- 10.18653/v1/2021.calcs-1.16
- Cite (ACL):
- Salim Sazzed. 2021. Abusive content detection in transliterated Bengali-English social media corpus. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 125–130, Online. Association for Computational Linguistics.
- Cite (Informal):
- Abusive content detection in transliterated Bengali-English social media corpus (Sazzed, CALCS 2021)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2021.calcs-1.16.pdf
- Code
- sazzadcsedu/abusivecorpus