Investigating Sampling Bias in Abusive Language Detection

Dante Razo, Sandra Kübler


Abstract
Abusive language detection is becoming increasingly important, but we still understand little about the biases in our datasets for abusive language detection, and how these biases affect the quality of abusive language detection. In the work reported here, we reproduce the investigation of Wiegand et al. (2019) to determine differences between different sampling strategies. They compared boosted random sampling, where abusive posts are upsampled, and biased topic sampling, which focuses on topics that are known to cause abusive language. Instead of comparing individual datasets created using these sampling strategies, we use the sampling strategies on a single, large dataset, thus eliminating the textual source of the dataset as a potential confounding factor. We show that differences in the textual source can have more effect than the chosen sampling strategy.
Anthology ID:
2020.alw-1.9
Volume:
Proceedings of the Fourth Workshop on Online Abuse and Harms
Month:
November
Year:
2020
Address:
Online
Venue:
ALW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
70–78
Language:
URL:
https://aclanthology.org/2020.alw-1.9
DOI:
10.18653/v1/2020.alw-1.9
Bibkey:
Cite (ACL):
Dante Razo and Sandra Kübler. 2020. Investigating Sampling Bias in Abusive Language Detection. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 70–78, Online. Association for Computational Linguistics.
Cite (Informal):
Investigating Sampling Bias in Abusive Language Detection (Razo & Kübler, ALW 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.alw-1.9.pdf
Optional supplementary material:
 2020.alw-1.9.OptionalSupplementaryMaterial.zip
Video:
 https://slideslive.com/38939527