Jackson Trager
2026
Self-Explaining Hate Speech Detection with Moral Rationales
Francielle Vargas | Jackson Trager | Diego Alves | Matteo Guida | Surendrabikram Thapa | Berk At{\i}l | Daryna Dementieva | Andrew J Smart | Ameeta Agrawal
Findings of the Association for Computational Linguistics: ACL 2026
Francielle Vargas | Jackson Trager | Diego Alves | Matteo Guida | Surendrabikram Thapa | Berk At{\i}l | Daryna Dementieva | Andrew J Smart | Ameeta Agrawal
Findings of the Association for Computational Linguistics: ACL 2026
Existing hate speech detection models are often opaque and rely on surface-level lexical cues, which makes them vulnerable to spurious correlations and limits robustness, interpretability and cultural contextualization. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance while enhancing both faithful and plausible explanations. Although explanations become more concise, sufficiency decreases, indicating more compact and informative rationales. Fairness remains stable, suggesting that improvements in explanation quality do not introduce significant bias trade-offs.
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Preni Golazizian | Elnaz Rahmati | Jackson Trager | Zhivar Sourati | Nona Ghazizadeh | Georgios Chochlakis | Jose J. Alcocer | Kerby Bennett | Aarya Vijay Devnani | Parsa Hejabi | Harry G. Muttram | Akshay Kiran Padte | Mehrshad Saadatinia | Chenhao Wu | Alireza Salkhordeh Ziabari | Michael Sierra-Ar\'evalo | Nicholas Weller | Shrikanth Narayanan | Benjamin A.t. Graham | Morteza Dehghani
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Preni Golazizian | Elnaz Rahmati | Jackson Trager | Zhivar Sourati | Nona Ghazizadeh | Georgios Chochlakis | Jose J. Alcocer | Kerby Bennett | Aarya Vijay Devnani | Parsa Hejabi | Harry G. Muttram | Akshay Kiran Padte | Mehrshad Saadatinia | Chenhao Wu | Alireza Salkhordeh Ziabari | Michael Sierra-Ar\'evalo | Nicholas Weller | Shrikanth Narayanan | Benjamin A.t. Graham | Morteza Dehghani
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Traffic stops are among the most frequent police–civilian interactions, and body-worn cameras (BWCs) provide a unique record of how these encounters unfold. Respect is a central dimension of these interactions, shaping public trust and perceived legitimacy, yet its interpretation is inherently subjective and shaped by lived experience, rendering community-specific perspectives a critical consideration. Leveraging unprecedented access to Los Angeles Police Department BWC footage, we introduce the first large-scale traffic-stop dataset annotated with respect ratings and free-text rationales from multiple perspectives. By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities. To this end, (i) we develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) we introduce a criterion-driven preference data construction framework for perspective-consistent alignment, and (ii) we propose a perspective-aware modeling framework that predicts personalized respect ratings and generates annotator-specific rationales for both officers and civilian drivers from traffic-stop transcripts. Across all three annotator groups, our approach improves both rating prediction performance and rationale alignment. Our perspective-aware framework enables law enforcement to better understand diverse community expectations, providing a vital tool for building public trust and procedural legitimacy.
2025
MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation
Jackson Trager | Francielle Vargas | Diego Alves | Matteo Guida | Mikel K. Ngueajio | Ameeta Agrawal | Yalda Daryani | Farzan Karimi Malekabadi | Flor Miriam Plaza-del-Arco
Findings of the Association for Computational Linguistics: EMNLP 2025
Jackson Trager | Francielle Vargas | Diego Alves | Matteo Guida | Mikel K. Ngueajio | Ameeta Agrawal | Yalda Daryani | Farzan Karimi Malekabadi | Flor Miriam Plaza-del-Arco
Findings of the Association for Computational Linguistics: EMNLP 2025
Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanations using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.
Search
Fix author
Co-authors
- Ameeta Agrawal 2
- Diego Alves 2
- Matteo Guida 2
- Francielle Vargas 2
- Jose J. Alcocer 1
- Berk At{\i}l 1
- Kerby Bennett 1
- Georgios Chochlakis 1
- Yalda Daryani 1
- Morteza Dehghani 1
- Daryna Dementieva 1
- Aarya Vijay Devnani 1
- Nona Ghazizadeh 1
- Preni Golazizian 1
- Benjamin A.t. Graham 1
- Parsa Hejabi 1
- Farzan Karimi Malekabadi 1
- Harry G. Muttram 1
- Shrikanth Narayanan 1
- Mikel K. Ngueajio 1
- Akshay Kiran Padte 1
- Flor Miriam Plaza-del-Arco 1
- Elnaz Rahmati 1
- Mehrshad Saadatinia 1
- Michael Sierra-Ar\'evalo 1
- Andrew J Smart 1
- Zhivar Sourati 1
- Surendrabikram Thapa 1
- Nicholas Weller 1
- Chenhao Wu 1
- Alireza Salkhordeh Ziabari 1