Hate Explained: Evaluating NER-Enriched Text in Human and Machine Moderation of Hate Speech

Andres Carvallo, Marcelo Mendoza, Miguel Fernandez, Maximiliano Ojeda, Lilly Guevara, Diego Varela, Martin Borquez, Nicolas Buzeta, Felipe Ayala


Abstract
Hate speech detection is vital for creating safe online environments, as harmful content can drive social polarization. This study explores the impact of enriching text with intent and group tags on machine performance and human moderation workflows. For machine performance, we enriched text with intent and group tags to train hate speech classifiers. Intent tags were the most effective, achieving state-of-the-art F1-score improvements on the IHC, SBIC, and DH datasets, respectively. Cross-dataset evaluations further demonstrated the superior generalization of intent-tagged models compared to other pre-trained approaches. Then, through a user study (N=100), we evaluated seven moderation settings, including intent tags, group tags, model probabilities, and randomized counterparts. Intent annotations significantly improved the accuracy of the moderators, allowing them to outperform machine classifiers by 12.9%. Moderators also rated intent tags as the most useful explanation tool, with a 41% increase in perceived helpfulness over the control group. Our findings demonstrate that intent-based annotations enhance both machine classification performance and human moderation workflows.
Anthology ID:
2025.woah-1.42
Volume:
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del-Arco, Zeerak Talat, Francielle Vargas
Venues:
WOAH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
458–467
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.woah-1.42/
DOI:
Bibkey:
Cite (ACL):
Andres Carvallo, Marcelo Mendoza, Miguel Fernandez, Maximiliano Ojeda, Lilly Guevara, Diego Varela, Martin Borquez, Nicolas Buzeta, and Felipe Ayala. 2025. Hate Explained: Evaluating NER-Enriched Text in Human and Machine Moderation of Hate Speech. In Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH), pages 458–467, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Hate Explained: Evaluating NER-Enriched Text in Human and Machine Moderation of Hate Speech (Carvallo et al., WOAH 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.woah-1.42.pdf