Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems

Neil Fasching, Yphtach Lelkes


Abstract
Content moderation systems powered by large language models (LLMs) are increasingly deployed to detect hate speech; however, no systematic comparison exists between different systems. If different systems produce different outcomes for the same content, it undermines consistency and predictability, leading to moderation decisions that appear arbitrary or unfair. Analyzing seven leading models—dedicated Moderation Endpoints (OpenAI, Mistral), frontier LLMs (Claude 3.5 Sonnet, GPT-4o, Mistral Large, DeepSeek V3), and specialized content moderation APIs (Google Perspective API)—we demonstrate that moderation system choice fundamentally determines hate speech classification outcomes. Using a novel synthetic dataset of 1.3+ million sentences from a factorial design, we find identical content receives markedly different classification values across systems, with variations especially pronounced for specific demographic groups. Analysis across 125 distinct groups reveals these divergences reflect systematic differences in how models establish decision boundaries around harmful content, highlighting significant implications for automated content moderation.
Anthology ID:
2025.findings-acl.1144
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22271–22285
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1144/
DOI:
Bibkey:
Cite (ACL):
Neil Fasching and Yphtach Lelkes. 2025. Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems. In Findings of the Association for Computational Linguistics: ACL 2025, pages 22271–22285, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems (Fasching & Lelkes, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1144.pdf