A Study of the Class Imbalance Problem in Abusive Language Detection

Yaqi Zhang, Viktor Hangya, Alexander Fraser


Abstract
Abusive language detection has drawn increasing interest in recent years. However, a less systematically explored obstacle is label imbalance, i.e., the amount of abusive data is much lower than non-abusive data, leading to performance issues. The aim of this work is to conduct a comprehensive comparative study of popular methods for addressing the class imbalance issue. We explore 10 well-known approaches on 8 datasets with distinct characteristics: binary or multi-class, moderately or largely imbalanced, focusing on various types of abuse, etc. Additionally, we pro-pose two novel methods specialized for abuse detection: AbusiveLexiconAug and ExternalDataAug, which enrich the training data using abusive lexicons and external abusive datasets, respectively. We conclude that: 1) our AbusiveLexiconAug approach, random oversampling, and focal loss are the most versatile methods on various datasets; 2) focal loss tends to yield peak model performance; 3) oversampling and focal loss provide promising results for binary datasets and small multi-class sets, while undersampling and weighted cross-entropy are more suitable for large multi-class sets; 4) most methods are sensitive to hyperparameters, yet our suggested choice of hyperparameters provides a good starting point.
Anthology ID:
2024.woah-1.4
Volume:
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yi-Ling Chung, Zeerak Talat, Debora Nozza, Flor Miriam Plaza-del-Arco, Paul Röttger, Aida Mostafazadeh Davani, Agostina Calabrese
Venues:
WOAH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38–51
Language:
URL:
https://aclanthology.org/2024.woah-1.4
DOI:
Bibkey:
Cite (ACL):
Yaqi Zhang, Viktor Hangya, and Alexander Fraser. 2024. A Study of the Class Imbalance Problem in Abusive Language Detection. In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), pages 38–51, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
A Study of the Class Imbalance Problem in Abusive Language Detection (Zhang et al., WOAH-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.woah-1.4.pdf