HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content
Lorenzo Puppi Vecchi, Alceu De Souza Britto Jr., Emerson Cabrera Paraiso, Rafael M. O. Cruz
Abstract
Explaining why content is hateful using natural language is crucial for fostering transparency in automated content moderation systems. However, evaluating the quality of such explanations remains an open challenge. General-purpose reward models (RMs), commonly used for scoring natural language outputs, are typically optimized for broad notions of safety. We argue that this optimization penalizes situations where references to stereotypes or offensive content are essential for explanations with higher explanatory fidelity. To address this gap, we introduce SBIC-Explain, a human-validated dataset of 370,788 LLM generated NLEs for offensive content, spanning three levels of human-annotated contextual richness: Tier 1: text-only, Tier 2: + classification-aware, and Tier 3: + semantics-informed. We hypothesize that as human-annotated context increases, explanations should lead to higher perceived explanations with higher explanatory fidelity. Yet, we find that existing RMs systematically assign lower scores to more contextually rich (and often more offensive) explanations, revealing a misalignment between model preferences and explanatory fidelity for this context. We propose HARM (Hate-Aware Reward Model), a RM that integrates interpretable signals to better align reward scores with the needs of hate speech explanation. HARM outperforms general-purpose baselines, improving NLE pair-wise preference. Available at: https://github.com/Lorenzo815/HARM.- Anthology ID:
- 2026.findings-eacl.230
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4393–4431
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.230/
- DOI:
- Cite (ACL):
- Lorenzo Puppi Vecchi, Alceu De Souza Britto Jr., Emerson Cabrera Paraiso, and Rafael M. O. Cruz. 2026. HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content. In Findings of the Association for Computational Linguistics: EACL 2026, pages 4393–4431, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content (Vecchi et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.230.pdf