Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu


Abstract
Although social media platforms are a prominent arena for users to engage in interpersonal discussions and express opinions, the facade and anonymity offered by social media may allow users to spew hate speech and offensive content. Given the massive scale of such platforms, there arises a need to automatically identify and flag instances of hate speech. Although several hate speech detection methods exist, most of these black-box methods are not interpretable or explainable by design. To address the lack of interpretability, in this paper, we propose to use state-of-the-art Large Language Models (LLMs) to extract features in the form of rationales from the input text, to train a base hate speech classifier, thereby enabling faithful interpretability by design. Our framework effectively combines the textual understanding capabilities of LLMs and the discriminative power of state-of-the-art hate speech classifiers to make these classifiers faithfully interpretable. Our comprehensive evaluation on a variety of social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability. All code and data will be made available at https://github.com/AmritaBh/shield.
Anthology ID:
2024.woah-1.17
Volume:
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yi-Ling Chung, Zeerak Talat, Debora Nozza, Flor Miriam Plaza-del-Arco, Paul Röttger, Aida Mostafazadeh Davani, Agostina Calabrese
Venues:
WOAH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
223–233
Language:
URL:
https://aclanthology.org/2024.woah-1.17
DOI:
Bibkey:
Cite (ACL):
Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, and Huan Liu. 2024. Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales. In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), pages 223–233, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales (Nirmal et al., WOAH-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.woah-1.17.pdf