HATECAT-TR: A Hate Speech Span Detection and Categorization Dataset for Turkish

Hasan Kerem Şeker, Gökçe Uludoğan, Pelin Önal, Arzucan Özgür


Abstract
Hate speech on social media in Turkey remains a critical issue, frequently targeting minority groups. Effective moderation requires not only detecting hateful posts but also identifying the specific hateful expressions within them. To address this, we introduce HATECAT-TR, a span-annotated dataset of Turkish tweets, containing 4465 hateful spans across 2981 posts, each directed at one of eight minority groups. Annotations were created using a semi-automated approach, combining GPT-4o-generated spans with human expert review to ensure accuracy. Each hateful span is categorized into one of five discourse types, enabling a fine-grained analysis of the nature and intent behind hateful content. We frame span detection as binary and multi-class token classification tasks and utilize the state-of-the-art language models to establish a baseline performance for the new dataset. Our findings highlight the challenges of detecting and categorizing implicit hate speech, particularly when spans are subtle and highly contextual. The source code is available at github.com/boun-tabi/hatecat-tr and HATECAT-TR can be shared by complying with the terms of X upon contacting the authors.
Anthology ID:
2025.findings-emnlp.1393
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25568–25579
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1393/
DOI:
10.18653/v1/2025.findings-emnlp.1393
Bibkey:
Cite (ACL):
Hasan Kerem Şeker, Gökçe Uludoğan, Pelin Önal, and Arzucan Özgür. 2025. HATECAT-TR: A Hate Speech Span Detection and Categorization Dataset for Turkish. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 25568–25579, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
HATECAT-TR: A Hate Speech Span Detection and Categorization Dataset for Turkish (Şeker et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1393.pdf
Checklist:
 2025.findings-emnlp.1393.checklist.pdf