Abstract
In an era where social media platform users are growing rapidly, there has been a marked increase in hateful content being generated; to combat this, automatic hate speech detection systems are a necessity. For this purpose, researchers have recently focused their efforts on developing datasets, however, the vast majority of them have been generated for the English language, with only a few available for low-resource languages such as Roman Urdu. Furthermore, what few are available have small number of samples that pertain to hateful classes and these lack variations in topics and content. Thus, deep learning models trained on such datasets perform poorly when deployed in the real world. To improve performance the option of collecting and annotating more data can be very costly and time consuming. Thus, data augmentation techniques need to be explored to exploit already available datasets to improve model generalizability. In this paper, we explore different data augmentation techniques for the improvement of hate speech detection in Roman Urdu. We evaluate these augmentation techniques on two datasets. We are able to improve performance in the primary metric of comparison (F1 and Macro F1) as well as in recall, which is impertinent for human-in-the-loop AI systems.- Anthology ID:
- 2022.lrec-1.481
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4523–4531
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.481
- DOI:
- Cite (ACL):
- Ubaid Azam, Hammad Rizwan, and Asim Karim. 2022. Exploring Data Augmentation Strategies for Hate Speech Detection in Roman Urdu. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4523–4531, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Exploring Data Augmentation Strategies for Hate Speech Detection in Roman Urdu (Azam et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/2022.lrec-1.481.pdf
- Data
- Hate Speech and Offensive Language