Self-Explaining Hate Speech Detection with Moral Rationales

Francielle Vargas; Jackson Trager; Diego Alves; Matteo Guida; Surendrabikram Thapa; Berk At{\i}l; Daryna Dementieva; Andrew J Smart; Ameeta Agrawal

Self-Explaining Hate Speech Detection with Moral Rationales

Francielle Vargas, Jackson Trager, Diego Alves, Matteo Guida, Surendrabikram Thapa, Berk At{\i}l, Daryna Dementieva, Andrew J Smart, Ameeta Agrawal

Abstract

Existing hate speech detection models are often opaque and rely on surface-level lexical cues, which makes them vulnerable to spurious correlations and limits robustness, interpretability and cultural contextualization. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance while enhancing both faithful and plausible explanations. Although explanations become more concise, sufficiency decreases, indicating more compact and informative rationales. Fairness remains stable, suggesting that improvements in explanation quality do not introduce significant bias trade-offs.

Anthology ID:: 2026.findings-acl.1704
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34109–34131
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1704/
DOI:
Bibkey:
Cite (ACL):: Francielle Vargas, Jackson Trager, Diego Alves, Matteo Guida, Surendrabikram Thapa, Berk At{\i}l, Daryna Dementieva, Andrew J Smart, and Ameeta Agrawal. 2026. Self-Explaining Hate Speech Detection with Moral Rationales. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34109–34131, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Self-Explaining Hate Speech Detection with Moral Rationales (Vargas et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1704.pdf
Checklist:: 2026.findings-acl.1704.checklist.pdf

PDF Cite Search Checklist Fix data