Berk At{\i}l

2026

Self-Explaining Hate Speech Detection with Moral Rationales
Francielle Vargas | Jackson Trager | Diego Alves | Matteo Guida | Surendrabikram Thapa | Berk At{\i}l | Daryna Dementieva | Andrew J Smart | Ameeta Agrawal
Findings of the Association for Computational Linguistics: ACL 2026

Existing hate speech detection models are often opaque and rely on surface-level lexical cues, which makes them vulnerable to spurious correlations and limits robustness, interpretability and cultural contextualization. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance while enhancing both faithful and plausible explanations. Although explanations become more concise, sufficiency decreases, indicating more compact and informative rationales. Fairness remains stable, suggesting that improvements in explanation quality do not introduce significant bias trade-offs.

pdf bib abs

♪ Something Just Like TRuST ♪ *: Toxicity Recognition of Span and Target
Berk At{\i}l | Namrata Sureddy | Rebecca J. Passonneau
Findings of the Association for Computational Linguistics: ACL 2026

Toxic language includes content that is offensive, abusive, or that promotes harm. Progress in preventing toxic output from large language models (LLMs) is hampered by inconsistent definitions of toxicity. We introduce TRuST, a large-scale dataset that unifies and expands prior resources through a carefully synthesized definition of toxicity, and corresponding annotation scheme. It consists of ∼300k annotations, with high-quality human annotation on ∼11k. To ensure high-quality, we designed a rigorous, multi-stage human annotation process, and evaluated the diversity of the annotators. Then we benchmarked state-of-the-art LLMs and pre-trained models on three tasks: toxicity detection, identification of the target group, and of toxic words. Our results indicate that fine-tuned PLMs outperform LLMs on the three tasks, and that current reasoning models do not reliably improve performance. TRuST constitutes one of the most comprehensive resources for evaluating and mitigating LLM toxicity, and other research in socially-aware and safer language technologies.

Co-authors

Andrew J Smart 1

Namrata Sureddy 1

Surendrabikram Thapa 1

Jackson Trager 1

Francielle Vargas 1

Venues

Findings2

Fix author