Abstract
The performance of hate speech detection models relies on the datasets on which the models are trained. Existing datasets are mostly prepared with a limited number of instances or hate domains that define hate topics. This hinders large-scale analysis and transfer learning with respect to hate domains. In this study, we construct large-scale tweet datasets for hate speech detection in English and a low-resource language, Turkish, consisting of human-labeled 100k tweets per each. Our datasets are designed to have equal number of tweets distributed over five domains. The experimental results supported by statistical tests show that Transformer-based language models outperform conventional bag-of-words and neural models by at least 5% in English and 10% in Turkish for large-scale hate speech detection. The performance is also scalable to different training sizes, such that 98% of performance in English, and 97% in Turkish, are recovered when 20% of training instances are used. We further examine the generalization ability of cross-domain transfer among hate domains. We show that 96% of the performance of a target domain in average is recovered by other domains for English, and 92% for Turkish. Gender and religion are more successful to generalize to other domains, while sports fail most.- Anthology ID:
 - 2022.lrec-1.238
 - Volume:
 - Proceedings of the Thirteenth Language Resources and Evaluation Conference
 - Month:
 - June
 - Year:
 - 2022
 - Address:
 - Marseille, France
 - Editors:
 - Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
 - Venue:
 - LREC
 - SIG:
 - Publisher:
 - European Language Resources Association
 - Note:
 - Pages:
 - 2215–2225
 - Language:
 - URL:
 - https://aclanthology.org/2022.lrec-1.238
 - DOI:
 - Cite (ACL):
 - Cagri Toraman, Furkan Şahinuç, and Eyup Yilmaz. 2022. Large-Scale Hate Speech Detection with Cross-Domain Transfer. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2215–2225, Marseille, France. European Language Resources Association.
 - Cite (Informal):
 - Large-Scale Hate Speech Detection with Cross-Domain Transfer (Toraman et al., LREC 2022)
 - PDF:
 - https://preview.aclanthology.org/ingest-acl-2023-videos/2022.lrec-1.238.pdf
 - Code
 - avaapm/hatespeech
 - Data
 - HateXplain