M-BAD: A Multilabel Dataset for Detecting Aggressive Texts and Their Targets

Omar Sharif; Eftekhar Hossain; Mohammed Moshiul Hoque

doi:10.18653/v1/2022.constraint-1.9

M-BAD: A Multilabel Dataset for Detecting Aggressive Texts and Their Targets

Omar Sharif, Eftekhar Hossain, Mohammed Moshiul Hoque

Abstract

Recently, detection and categorization of undesired (e. g., aggressive, abusive, offensive, hate) content from online platforms has grabbed the attention of researchers because of its detrimental impact on society. Several attempts have been made to mitigate the usage and propagation of such content. However, most past studies were conducted primarily for English, where low-resource languages like Bengali remained out of the focus. Therefore, to facilitate research in this arena, this paper introduces a novel multilabel Bengali dataset (named M-BAD) containing 15650 texts to detect aggressive texts and their targets. Each text of M-BAD went through rigorous two-level annotations. At the primary level, each text is labelled as either aggressive or non-aggressive. In the secondary level, the aggressive texts have been further annotated into five fine-grained target classes: religion, politics, verbal, gender and race. Baseline experiments are carried out with different machine learning (ML), deep learning (DL) and transformer models, where Bangla-BERT acquired the highest weighted f₁-score in both detection (0.92) and target identification (0.83) tasks. Error analysis of the models exhibits the difficulty to identify context-dependent aggression, and this work argues that further research is required to address these issues.

Anthology ID:: 2022.constraint-1.9
Volume:: Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Tanmoy Chakraborty, Md. Shad Akhtar, Kai Shu, H. Russell Bernard, Maria Liakata, Preslav Nakov, Aseem Srivastava
Venue:: CONSTRAINT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 75–85
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2022.constraint-1.9/
DOI:: 10.18653/v1/2022.constraint-1.9
Bibkey:
Cite (ACL):: Omar Sharif, Eftekhar Hossain, and Mohammed Moshiul Hoque. 2022. M-BAD: A Multilabel Dataset for Detecting Aggressive Texts and Their Targets. In Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations, pages 75–85, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: M-BAD: A Multilabel Dataset for Detecting Aggressive Texts and Their Targets (Sharif et al., CONSTRAINT 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2022.constraint-1.9.pdf
Video:: https://preview.aclanthology.org/ingest-emnlp/2022.constraint-1.9.mp4

PDF Cite Search Video Fix data