Sarthak Roy
2025
HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation
Naquee Rizwan
|
Seid Muhie Yimam
|
Daryna Dementieva
|
Dr. Florian Skupin
|
Tim Fischer
|
Daniil Moskovskiy
|
Aarushi Ajay Borkar
|
Robert Geislinger
|
Punyajoy Saha
|
Sarthak Roy
|
Martin Semmann
|
Alexander Panchenko
|
Chris Biemann
|
Animesh Mukherjee
Findings of the Association for Computational Linguistics: ACL 2025
Despite regulations imposed by nations and social media platforms, e.g. (Government of India, 2021; European Parliament and Council of the European Union, 2022), inter alia, hateful content persists as a significant challenge. Existing approaches primarily rely on reactive measures such as blocking or suspending offensive messages, with emerging strategies focusing on proactive measurements like detoxification and counterspeech. In our work, which we call HATEPRISM, we conduct a comprehensive examination of hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, alongside a lack of alignment with research efforts. Based on these insights, we suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation incorporating diverse strategies.
2023
Probing LLMs for hate speech detection: strengths and vulnerabilities
Sarthak Roy
|
Ashish Harshvardhan
|
Animesh Mukherjee
|
Punyajoy Saha
Findings of the Association for Computational Linguistics: EMNLP 2023
Recently efforts have been made by social media platforms as well as researchers to detect hateful or toxic language using large language models. However, none of these works aim to use explanation, additional context and victim community information in the detection process. We utilise different prompt variation, input information and evaluate large language models in zero shot setting (without adding any in-context examples). We select two large language models (GPT-3.5 and text-davinci) and three datasets - HateXplain, implicit hate and ToxicSpans. We find that on average including the target information in the pipeline improves the model performance substantially (∼20-30%) over the baseline across the datasets. There is also a considerable effect of adding the rationales/explanations into the pipeline (∼10-20%) over the baseline across the datasets. In addition, we further provide a typology of the error cases where these large language models fail to (i) classify and (ii) explain the reason for the decisions they take. Such vulnerable points automatically constitute ‘jailbreak’ prompts for these models and industry scale safeguard techniques need to be developed to make the models robust against such prompts.