Large-Scale Hate Speech Detection with Cross-Domain Transfer

Cagri Toraman; Furkan Şahinuç; Eyup Yilmaz

Large-Scale Hate Speech Detection with Cross-Domain Transfer

Cagri Toraman, Furkan Şahinuç, Eyup Yilmaz

Abstract

The performance of hate speech detection models relies on the datasets on which the models are trained. Existing datasets are mostly prepared with a limited number of instances or hate domains that define hate topics. This hinders large-scale analysis and transfer learning with respect to hate domains. In this study, we construct large-scale tweet datasets for hate speech detection in English and a low-resource language, Turkish, consisting of human-labeled 100k tweets per each. Our datasets are designed to have equal number of tweets distributed over five domains. The experimental results supported by statistical tests show that Transformer-based language models outperform conventional bag-of-words and neural models by at least 5% in English and 10% in Turkish for large-scale hate speech detection. The performance is also scalable to different training sizes, such that 98% of performance in English, and 97% in Turkish, are recovered when 20% of training instances are used. We further examine the generalization ability of cross-domain transfer among hate domains. We show that 96% of the performance of a target domain in average is recovered by other domains for English, and 92% for Turkish. Gender and religion are more successful to generalize to other domains, while sports fail most.

Anthology ID:: 2022.lrec-1.238
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 2215–2225
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2022.lrec-1.238/
DOI:
Bibkey:
Cite (ACL):: Cagri Toraman, Furkan Şahinuç, and Eyup Yilmaz. 2022. Large-Scale Hate Speech Detection with Cross-Domain Transfer. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2215–2225, Marseille, France. European Language Resources Association.
Cite (Informal):: Large-Scale Hate Speech Detection with Cross-Domain Transfer (Toraman et al., LREC 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2022.lrec-1.238.pdf
Code: avaapm/hatespeech
Data: HateXplain

PDF Cite Search Code Fix data