Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del-Arco, Zeerak Talat, Francielle Vargas (Editors)


Anthology ID:
2025.woah-1
Month:
August
Year:
2025
Address:
Vienna, Austria
Venues:
WOAH | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.woah-1/
DOI:
ISBN:
979-8-89176-105-6
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.woah-1.pdf

pdf bib
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)
Agostina Calabrese | Christine de Kock | Debora Nozza | Flor Miriam Plaza-del-Arco | Zeerak Talat | Francielle Vargas

pdf bib
A Comprehensive Taxonomy of Bias Mitigation Methods for Hate Speech Detection
Jan Fillies | Marius Wawerek | Adrian Paschke

Algorithmic hate speech detection is widely used today. However, biases within these systems can lead to discrimination. This research presents an overview of bias mitigation strategies in the field of hate speech detection. The identified principles are grouped into four categories, based on their operation principles. A novel taxonomy of bias mitigation methods is proposed. The mitigation strategies are characterized based on their key concepts and analyzed in terms of their application stage and their need for knowledge of protected attributes. Additionally, the paper discusses potential combinations of these strategies. This research shifts the focus from identifying present biases to examining the similarities and differences between mitigation strategies, thereby facilitating the exchange, stacking, and ensembling of these strategies in future research.

pdf bib
Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation
Dimosthenis Antypas | Indira Sen | Carla Perez Almendros | Jose Camacho-Collados | Francesco Barbieri

The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language, leaving gaps in detecting other sensitive categories such as substance abuse or self-harm. In this paper, we put forward a unified dataset tailored for social media content moderation across six sensitive categories: conflictual language, profanity,sexually explicit material, drug-related content, self-harm, and spam. By collecting and annotating data with consistent retrieval strategies and guidelines, we address the shortcomings of previous focalised research. Our analysis demonstrates that fine-tuning large language models (LLMs) on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which underperform by 10-15% overall. This limitation is even more pronounced on popular moderation APIs, which cannot be easily tailored to specific sensitive content categories, among others.

pdf bib
From civility to parity: Marxist-feminist ethics for context-aware algorithmic content moderation
Dayei Oh

Algorithmic content moderation governs online speech on large-scale commercial platforms, often under the guise of neutrality. Yet, it routinely reproduces white, middle-class norms of civility and penalizes marginalized voices for unruly and resistant speech. This paper critiques the prevailing ‘pathological’ approach to moderation that prioritizes sanitization over justice. Drawing on Marxist-feminist ethics, this paper advances three theses for the future of context-aware algorithmic moderation: (1) prioritizing participatory parity over civility, (2) incorporating identity- and context-aware analysis of speech; and (3) replacing purely numerical evaluations with justice-oriented, community-sensitive metrics. While acknowledging the structural limitations posed by platform capitalism, this paper positions the proposed framework as both critique and provocation, guiding regulatory reform, civil advocacy, and visions for mission-driven online content moderation serving digital commons.

pdf bib
A Novel Dataset for Classifying German Hate Speech Comments with Criminal Relevance
Vincent Kums | Florian Meyer | Luisa Pivit | Uliana Vedenina | Jonas Wortmann | Melanie Siegel | Dirk Labudde

The consistently high prevalence of hate speech on the Internet continues to pose significant social and individual challenges. Given the centrality of social networks in public discourse, automating the identification of criminally relevant content is a pressing challenge. This study addresses the challenge of developing an automated system that is capable of classifying online comments in a criminal justice context and categorising them into relevant sections of the criminal code. Not only technical, but also ethical and legal requirements must be considered. To this end, 351 comments were annotated by public prosecutors from the Central Office for Combating Internet and Computer Crime (ZIT) according to previously formed paragraph classes. These groupings consist of several German criminal law statutes that most hate comments violate. In the subsequent phase of the research, a further 839 records were assigned to the classes by student annotators who had been trained previously.

pdf bib
Learning from Disagreement: Entropy-Guided Few-Shot Selection for Toxic Language Detection
Tommaso Caselli | Flor Miriam Plaza-del-Arco

In-context learning (ICL) has shown significant benefits, particularly in scenarios where large amounts of labeled data are unavailable. However, its effectiveness for highly subjective tasks, such as toxic language detection, remains an open question. A key challenge in this setting is to select shots to maximize performance. Although previous work has focused on enhancing variety and representativeness, the role of annotator disagreement in shot selection has received less attention. In this paper, we conduct an in-depth analysis of ICL using two families of open-source LLMs (Llama-3* and Qwen2.5) of varying sizes, evaluating their performance in five prominent English datasets covering multiple toxic language phenomena. We use disaggregated annotations and categorize different types of training examples to assess their impact on model predictions. We specifically investigate whether selecting shots based on annotators’ entropy – focusing on ambiguous or difficult examples – can improve generalization in LLMs. Additionally, we examine the extent to which the order of examples in prompts influences model behavior.Our results show that selecting shots based on entropy from annotator disagreement can enhance ICL performance. Specifically, ambiguous shots with a median entropy value generally lead to the best results for our selected LLMs in the few-shot setting. However, ICL often underperforms when compared to fine-tuned encoders.

pdf bib
Debiasing Static Embeddings for Hate Speech Detection
Ling Sun | Soyoung Kim | Xiao Dong | Sandra Kübler

We examine how embedding bias affects hate speech detection by evaluating two debiasing methods—hard-debiasing and soft-debiasing. We analyze stereotype and sentiment associations within the embedding space and assess whether debiased models reduce censorship of marginalized authors while improving detection of hate speech targeting these groups. Our findings highlight how embedding bias propagates into downstream tasks and demonstrates how well different embedding bias metrics can predict bias in hate speech detection.

pdf bib
Web(er) of Hate: A Survey on How Hate Speech Is Typed
Luna Wang | Andrew Caines | Alice Hutchings

The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber’s notion of ideal types, we argue for a reflexive approach in dataset creation, urging researchers to acknowledge their own value judgments during dataset construction, fostering transparency and methodological rigour.

pdf bib
Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate Speech.
Mikel Ngueajio | Flor Miriam Plaza-del-Arco | Yi-Ling Chung | Danda Rawat | Amanda Cercas Curry

Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere’s CommandR-7B, and Meta’s LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets.Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.

pdf bib
HODIAT: A Dataset for Detecting Homotransphobic Hate Speech in Italian with Aggressiveness and Target Annotation
Greta Damo | Alessandra Teresa Cignarella | Tommaso Caselli | Viviana Patti | Debora Nozza

The escalating spread of homophobic and transphobic rhetoric in both online and offline spaces has become a growing global concern, with Italy standing out as one of the countries where acts of violence against LGBTQIA+ individuals persist and increase year after year. This short paper study analyzes hateful language against LGBTQIA+ individuals in Italian using novel annotation labels for aggressiveness and target. We assess a range of multilingual and Italian language models on this newannotation layers across zero-shot, few-shot, and fine-tuning settings. The results reveal significant performance gaps across models and settings, highlighting the limitations of zero- and few-shot approaches and the importance of fine-tuning on labelled data, when available, to achieve high prediction performance.

pdf bib
Beyond the Binary: Analysing Transphobic Hate and Harassment Online
Anna Talas | Alice Hutchings

Online communities provide support and help to individuals transitioning gender. However, this point of transition also increases vulnerability, coupled with increased exposure to online harms. In this research, we analyse a popular hate and harassment site known for targeting minority groups, including transgender people. We analyse 17 million posts dating back to 2012 to gain insights into the types of information collected about targets. We find users commonly link to social media sites such as Twitter/X and meticulously archive links related to their targets. We scrape over 150,000 relevant links posted to Twitter/X and their archived versions and analyse the profiles and posts. We find targets often tweet about harassment, popculture, and queer and gender-related discussions. We develop and evaluate classifiers to detect calls for harassment, doxxing, mention of transgender individuals, and toxic/abusive speech within the forum posts. The results of our classifiers show that forum posts about transgender individuals are significantly more likely to contain other harmful content.

pdf bib
Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
Sergey Berezin | Reza Farahbakhsh | Noel Crespi

We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models’ failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.

pdf bib
Debunking with Dialogue? Exploring AI-Generated Counterspeech to Challenge Conspiracy Theories
Mareike Lisker | Christina Gottschalk | Helena Mihaljević

Counterspeech is a key strategy against harmful online content, but scaling expert-driven efforts is challenging. Large Language Models (LLMs) present a potential solution, though their use in countering conspiracy theories is under-researched. Unlike for hate speech, no datasets exist that pair conspiracy theory comments with expert-crafted counterspeech. We address this gap by evaluating the ability of GPT-4o, Llama 3, and Mistral to effectively apply counterspeech strategies derived from psychological research provided through structured prompts. Our results show that the models often generate generic, repetitive, or superficial results. Additionally, they over-acknowledge fear and frequently hallucinate facts, sources, or figures, making their prompt-based use in practical applications problematic.

pdf bib
MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages
Lu Kalkbrenner | Veronika Solopova | Steffen Zeiler | Robert Nickel | Dorothea Kolossa

Connectivity and message propagation are central, yet often underutilised, sources of information in misinformation detection—especially on poorly moderated platforms such as Telegram, which has become a critical channel for misinformation dissemination, namely in the German electoral context. In this paper, we introduce Misinfo-TeleGraph, the first German-language Telegram-based graph dataset for misinformation detection. It includes over 5 million messages from public channels, enriched with metadata, channel relationships, and both weak and strong labels. These labels are derived via semantic similarity to fact-checks and news articles using M3-embeddings, as well as manual annotation. To establish reproducible baselines, we evaluate both text-only models and graph neural networks (GNNs) that incorporate message forwarding as a network structure. Our results show that GraphSAGE with LSTM aggregation significantly outperforms text-only baselines in terms of Matthews Correlation Coefficient (MCC) and F1-score. We further evaluate the impact of subscribers, view counts, and automatically versus human-created labels on performance, and highlight both the potential and challenges of weak supervision in this domain. This work provides a reproducible benchmark and open dataset for future research on misinformation detection in German-language Telegram networks and other low-moderation social platforms.

pdf bib
Catching Stray Balls: Football, fandom, and the impact on digital discourse
Mark Hill

This paper examines how emotional responses to football matches influence online discourse across digital spaces on Reddit. By analysing millions of posts from dozens of subreddits, it demonstrates that real-world events trigger sentiment shifts that move across communities. It shows that negative sentiment correlates with problematic language; match outcomes directly influence sentiment and posting habits; sentiment can transfer to unrelated communities; and offers insights into the content of this shifting discourse. These findings reveal how digital spaces function not as isolated environments, but as interconnected emotional ecosystems vulnerable to cross-domain contagion triggered by real-world events, contributing to our understanding of the propagation of online toxicity. While football is used as a case-study to computationally measure affective causes and movements, these patterns have implications for understanding online communities broadly.

pdf bib
Exploring Hate Speech Detection Models for Lithuanian Language
Justina Mandravickaitė | Eglė Rimkienė | Mindaugas Petkevičius | Milita Songailaitė | Eimantas Zaranka | Tomas Krilavičius

Online hate speech poses a significant challenge, as it can incite violence and contribute to social polarization. This study evaluates traditional machine learning, deep learning and large language models (LLMs) for Lithuanian hate speech detection, addressing class imbalance issue via data augmentation and resampling techniques. Our dataset included 27,358 user-generated comments, annotated into Neutral language (56%), Offensive language (29%) and Hate speech (15%). We trained BiLSTM, LSTM, CNN, SVM, and Random Forest models and fine-tuned Multilingual BERT, LitLat BERT, Electra, RWKV, ChatGPT, LT-Llama-2, and Gemma-2 models. Additionally, we pre-trained Electra for Lithuanian. Models were evaluated using accuracy and weighted F1-score. On the imbalanced dataset, LitLat BERT (0.76 weighted F1-score) and Multilingual BERT (0.73 weighted F1-score) performed best. Over-sampling further boosted weighted F1-scores, with Multilingual BERT (0.85) and LitLat BERT (0.84) outperforming other models. Over-sampling combined with augmentation provided the best overall results. Under-sampling led to performance declines and was less effective. Finally, fine-tuning LLMs improved their accuracy which highlighted the importance of fine-tuning for more specialized NLP tasks.

pdf bib
RAG and Recall: Multilingual Hate Speech Detection with Semantic Memory
Khouloud Mnassri | Reza Farahbakhsh | Noel Crespi

Multilingual hate speech detection presents a challenging task, particularly in limited-resource contexts when performance is affected by cultural nuances and data scarcity. Fine-tuned models are often unable to generalize beyond their training, which limits their efficiency, especially for low-resource languages. In this paper, we introduce HS-RAG, a retrieval-augmented generation (RAG) system that directly leverages knowledge, in English, French, and Arabic, from Hate Speech Superset (publicly available dataset) and Wikipedia to Large Language Models (LLMs). To further enhance robustness, we introduce HS-MemRAG, a memory-augmented extension that integrates a semantic cache. This model reduces redundant retrieval while improving contextual relevance and hate speech detection among the three languages.

pdf bib
Implicit Hate Target Span Detection in Zero- and Few-Shot Settings with Selective Sub-Billion Parameter Models
Hossam Boudraa | Benoit Favre | Raquel Urena

This work investigates the effectiveness of masked language models (MLMs) and autoregressive language models (LLMs) with fewer than one billion parameters in the detection of implicit hate speech through fine-grained span identification. The evaluation spans zero-shot, few-shot, and full supervision settings across two core benchmarks—SBIC and IHC—and an auxiliary testbed, OffensiveLang.RoBERTa-Large-355M emerges as the strongest zero-shot model, achieving the highest F1 scores of 75.8 (SBIC) and 72.5 (IHC), outperforming larger models like LLaMA 3.2-1B. ModernBERT-125M closely matches this performance with scores of 75.1 and 72.2, demonstrating the advantage of architectural efficiency. Among instruction-tuned models, SmolLM2-135M Instruct and LLaMA 3.2 1B Instruct consistently outperform their non-instructed counterparts, with up to +2.3 F1 gain on SBIC and +1.7 on IHC. Interestingly, the larger SmolLM2-360M Instruct does not outperform the 135M variant, highlighting that model scale does not always correlate with performance in implicit hate detection tasks.Few-shot fine-tuning with SmolLM2-135M Instruct achieves F1 scores of 68.2 (SBIC) and 64.0 (IHC), trailing full-data fine-tuning by only 1.6 and 2.0 points, respectively, with accuracy drops under 0.5 points. This illustrates the promise of compact, instruction-aligned models in data-scarce settings, particularly when optimized with Low-Rank Adaptation (LoRA).Topic-guided error analysis using Latent Dirichlet Allocation (LDA) reveals recurring model failures in ideologically charged or euphemistic discourse. Misclassifications often involve neutral references to identity, politics, or advocacy language, underscoring current limitations in discourse-level inference and sociopragmatic understanding.

pdf bib
Hate Speech in Times of Crises: a Cross-Disciplinary Analysis of Online Xenophobia in Greece
Maria Pontiki | Vasiliki Georgiadou | Lamprini Rori | Maria Gavriilidou

Bridging NLP with political science, this paper examines both the potential and the limitations of a computational hate speech detection method in addressing real-world questions. Using Greece as a case study, we analyze over 4 million tweets from 2015 to 2022—a period marked by economic, refugee, foreign policy, and pandemic crises. The analysis of false positives highlights the challenges of accurately detecting different types of verbal attacks across various targets and timeframes. In addition, the analysis of true positives reveals distinct linguistic patterns that reinforce populist narratives, polarization and hostility. By situating these findings within their socio-political context, we provide insights into how hate speech manifests online in response to real-world crises.

pdf bib
Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting MPs
Mugdha Pandya | Mali Jin | Kalina Bontcheva | Diana Maynard

Social media platforms, particularly X, enable direct interaction between politicians and constituents but also expose politicians to hostile responses targetting both their governmental role and personal identity. This online hostility can undermine public trust and potentially incite offline violence. While general hostility detection models exist, they lack the specificity needed for political contexts and country-specific issues. We address this gap by creating a dataset of 3,320 English tweets directed at UK Members of Parliament (MPs) over two years, annotated for hostility and targeted identity characteristics (race, gender, religion). Through linguistic and topical analyses, we examine the unique features of UK political discourse and evaluate pre-trained language models and large language models on binary hostility detection and multi-class targeted identity type classification tasks. Our work provides essential data and insights for studying politics-related hostility in the UK.

pdf bib
Detoxify-IT: An Italian Parallel Dataset for Text Detoxification
Viola De Ruvo | Arianna Muti | Daryna Dementieva | Debora Nozza

Toxic language online poses growing challenges for content moderation. Detoxification, which rewrites toxic content into neutral form, offers a promising alternative but remains underexplored beyond English. We present Detoxify-IT, the first Italian dataset for this task, featuring toxic comments and their human-written neutral rewrites. Our experiments show that even limited fine-tuning on Italian data leads to notable improvements in content preservation and fluency compared to both multilingual models and LLMs used in zero-shot settings, underlining the need for language-specific resources. This work enables detoxification research in Italian and supports broader efforts toward safer, more inclusive online communication.

pdf bib
Pathways to Radicalisation: On Research for Online Radicalisation in Natural Language Processing and Machine Learning
Zeerak Talat | Michael Sejr Schlichtkrull | Pranava Madhyastha | Christine De Kock

Online communities play an integral part in communication for communication across the globe. Online communities that are known for extremist content. As a field of surveillance technologies, NLP and other ML fields hold particular promise for monitoring extremist communities that may turn violent.Such communities make use of a wide variety of modalities of communication, including textual posts on specialised fora, memes, videos, and podcasts. Furthermore, such communities undergo rapid linguistic evolution, thus presenting a challenge to machine learning technologies that quickly diverge from the data that are used. In this position, we argue that radicalisation is a nascent area for which machine learning is particularly apt. However, in addressing radicalisation research it is important that avoids falling into the temptation of focusing on prediction. We argue that such communities present a particular avenue for addressing key concerns with machine learning technologies: (1) temporal misalignment of models and (2) aligning and linking content across modalities.

pdf bib
Social Hatred: Efficient Multimodal Detection of Hatemongers
Tom Marzea | Abraham Israeli | Oren Tsur

Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon.While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network.Evaluating our method on three unique datasets X (Twitter), Gab, and Parler we show that processing a user’s texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. We offer comprehensive set of results obtained in different experimental settings as well as qualitative analysis of illustrative cases.Our method can be used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as to inform intervention measures. Moreover, we demonstrate that our multimodal approach performs well across very different content platforms and over large datasets and networks.

pdf bib
Blue-haired, misandriche, rabiata: Tracing the Connotation of ‘Feminist(s)’ Across Time, Languages and Domains
Arianna Muti | Sara Gemelli | Emanuele Moscato | Emilie Francis | Amanda Cercas Curry | Flor Miriam Plaza-del-Arco | Debora Nozza

Understanding how words shift in meaning is crucial for analyzing societal attitudes.In this study, we investigate the contextual variations of the terms feminist, feminists along three axes: time, language, and domain.To this aim, we collect and release FEMME, a dataset comprising the occurrences of such terms from 2014 to 2023 in English, Italian and Swedish in Twitter, Reddit and Incel domains.Our methodology leverages frame analysis, as well as fine-tuning and LLMs. We find that the connotation of the plural form feminists is consistently more negative than feminist, indicating more hostility towards feminists as a collective, which often triggers greater societal pushback, reflecting broader patterns of group-based hostility and stigma. Across languages, we observe similar stereotypes towards feminists that often include body shaming, as well as accusations of hypocrisy and irrational behavior. In terms of time, we identify events that trigger a peak in terms of negative or positive connotation.As expected, the Incel spheres show predominantly negative connotations, while the general domains show mixed connotations.

pdf bib
Towards Fairness Assessment of Dutch Hate Speech Detection
Julie Bauer | Rishabh Kaushal | Thales Bertaglia | Adriana Iamnitchi

Numerous studies have proposed computational methods to detect hate speech online, yet most focus on the English language and emphasize model development. In this study, we evaluate the counterfactual fairness of hate speech detection models in the Dutch language, specifically examining the performance and fairness of transformer-based models.We make the following key contributions. First, we curate a list of Dutch Social Group Terms that reflect social context. Second, we generate counterfactual data for Dutch hate speech using LLMs and established strategies like Manual Group Substitution (MGS) and Sentence Log-Likelihood (SLL). Through qualitative evaluation, we highlight the challenges of generating realistic counterfactuals, particularly with Dutch grammar and contextual coherence. Third, we fine-tune baseline transformer-based models with counterfactual data and evaluate their performance in detecting hate speech. Fourth, we assess the fairness of these models using Counterfactual Token Fairness (CTF) and group fairness metrics, including equality of odds and demographic parity. Our analysis shows that models perform better in terms of hate speech detection, average counterfactual fairness and group fairness. This work addresses a significant gap in the literature on counterfactual fairness for hate speech detection in Dutch and provides practical insights and recommendations for improving both model performance and fairness.

pdf bib
Between Hetero-Fatalism and Dark Femininity: Discussions of Relationships, Sex, and Men in the Femosphere
Emilie Francis

The ‘femosphere’ is a term coined to describe a group of online ideological spaces for women characterised by toxicity, reactionary feminism, and hetero-pessimism. It is often portrayed as a mirror of a similar group of communities for men, called the ‘manosphere’. Although there have been several studies investigating the ideologies and language of the manosphere, the femosphere has been largely overlooked - especially in NLP. This paper presents a study of two communities in the femosphere: Female Dating Strategy and Femcels. It presents an exploration of the language of these communities on topics related to relationships, sex, and men from the perspective of hetero-pessimism using topic modelling and semantic analysis. It reveals dissatisfaction with heterosexual courtship and frustration with the patriarchal society through which members attempt to navigate.

pdf bib
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
Berk Atil | Vipul Gupta | Sarkar Snigdha Sarathi Das | Rebecca Passonneau

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations, such as their propensity to generate harmful output. This includes smaller LLMs, which are important for settings with constrained compute resources, such as edge devices. Detection of LLM harm typically requires human annotation, which is expensive to collect. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we compare harm annotation from three state-of-the-art large LLMs with each other and with humans. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans.

pdf bib
Are You Trying to Convince Me or Are You Trying to Deceive Me? Using Argumentation Types to Identify Deceptive News
Ricardo Muñoz Sánchez | Emilie Francis | Anna Lindahl

The way we relay factual information and the way we present deceptive information as truth differs from the perspective of argumentation. In this paper, we explore whether these differences can be exploited to detect deceptive political news in English. We do this by training a model to detect different kinds of argumentation in online news text. We use sentence embeddings extracted from an argumentation type classification model as features for a deceptive news classifier. This deceptive news classification model leverages the sequence of argumentation types within an article to determine whether it is credible or deceptive. Our approach outperforms other state-of-the-art models while having lower variance. Finally, we use the output of our argumentation model to analyze the differences between credible and deceptive news based on the distribution of argumentation types across the articles. Results of this analysis indicate that credible political news presents statements supported by a variety of argumentation types, while deceptive news relies on anecdotes and testimonial.

pdf bib
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
Taegyeong Lee | Jeonghwa Yoo | Hyoungseo Cho | Soo Yong Kim | Yunho Maeng

The recent advancements in Large Language Models(LLMs) have had a significant impact on a wide range of fields, from general domains to specialized areas. However, these advancements have also significantly increased the potential for malicious users to exploit harmful and jailbreak prompts for malicious attacks. Although there have been many efforts to prevent harmful prompts and jailbreak prompts, protecting LLMs from such malicious attacks remains an important and challenging task. In this paper, we propose QGuard, a simple yet effective safety guard method, that utilizes question prompting to block harmful prompts in a zero-shot manner. Our method can defend LLMs not only from text-based harmful prompts but also from multi-modal harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets. Additionally, by providing an analysis of question prompting, we enable a white-box analysis of user inputs. We believe our method provides valuable insights for real-world LLM services in mitigating security risks associated with harmful prompts.

pdf bib
Who leads? Who follows? Temporal dynamics of political dogwhistles in Swedish online communities
Max Boholm | Gregor Rettenegger | Ellen Breitholtz | Robin Cooper | Elina Lindgren | Björn Rönnerstrand | Asad Sayeed

A dogwhistle is a communicative act intended to broadcast a message only understood by a select in-group while going unnoticed by others (out-group). We illustrate that political dogwhistle behavior in a more radical community precedes the occurrence of the dogwhistles in a less radical community, but the reverse does not hold. We study two Swedish online communities – Flashback and Familjeliv – which both contain discussions of life and society, with the former having a stronger anti-immigrant subtext. Expressions associated with dogwhistles are substantially more frequent in Flashback than in Familjeliv. We analyze the time series of changes in intensity of three dogwhistle expressions (DWEs), i.e., the strength of association of a DWE and its in-group meaning modeled by Swedish Sentence-BERT, and model the dynamic temporal relationship of intensity in the two communities for the three DWEs using Vector Autoregression (VAR). We show that changes in intensity in Familjeliv are explained by the changes of intensity observed at previous lags in Flashback but not the other way around. This suggests a direction of travel for dogwhistles associated with radical ideologies to less radical contexts.

pdf bib
Detecting Child Objectification on Social Media: Challenges in Language Modeling
Miriam Schirmer | Angelina Voggenreiter | Juergen Pfeffer | Agnes Horvat

Online objectification of children can harm their self-image and influence how others perceive them. Objectifying comments may start with a focus on appearance but also include language that treats children as passive, decorative, or lacking agency. On TikTok, algorithm-driven visibility amplifies this focus on looks. Drawing on objectification theory, we introduce a Child Objectification Language Typology to automatically classify objectifying comments. Our dataset consists of 562,508 comments from 9,090 videos across 482 TikTok accounts. We compare language models of different complexity, including an n-gram-based model, RoBERTa, GPT-4, LlaMA, and Mistral. On our training dataset of 6,000 manually labeled comments, we found that RoBERTa performed best overall in detecting appearance- and objectification-related language. 10.35% of comments contained appearance-related language, while 2.90% included objectifying language. Videos with school-aged girls received more appearance-related comments compared to boys in that age group, while videos with toddlers show a slight increase in objectification-related comments compared to other age groups. Neither gender alone nor engagement metrics showed significant effects.The findings raise concerns about children’s digital exposure, emphasizing the need for stricter policies to protect minors.

pdf bib
Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study
Faeze Ghorbanpour | Daryna Dementieva | Alexandar Fraser

Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.

pdf bib
Multilingual Analysis of Narrative Properties in Conspiracist vs Mainstream Telegram Channels
Katarina Laken | Matteo Melis | Sara Tonelli | Marcos Garcia

Conspiracist narratives posit an omnipotent, evil group causing harm throughout domains. However, modern-day online conspiracism is often more erratic, consisting of loosely connected posts displaying a general anti-establishment attitude pervaded by negative emotions. We gather a dataset of 300 conspiracist and mainstream, Telegram channels in Italian and English and use the automatic extraction of entities and emotion detection to compare structural characteristics of both types of channels. We create a co-occurrence network of entities to analyze how the different types of channels introduce and use them across posts and topics. We find that conspiracist channels are characterized by anger. Moreover, co-occurrence networks of entities appearing in conspiracist channels are more dense. We theorize that this reflects a narrative structure where all actants are pushed into a single domain. Conspiracist channels disproportionately associate the most central group of entities with anger and fear. We do not find evidence that entities in conspiracist narratives occur across more topics. This could indicate an erratic type of online conspiracism where everything can be connected to everything and that is characterized by a high number of entities and high levels of anger.

pdf bib
Hate Explained: Evaluating NER-Enriched Text in Human and Machine Moderation of Hate Speech
Andres Carvallo | Marcelo Mendoza | Miguel Fernandez | Maximiliano Ojeda | Lilly Guevara | Diego Varela | Martin Borquez | Nicolas Buzeta | Felipe Ayala

Hate speech detection is vital for creating safe online environments, as harmful content can drive social polarization. This study explores the impact of enriching text with intent and group tags on machine performance and human moderation workflows. For machine performance, we enriched text with intent and group tags to train hate speech classifiers. Intent tags were the most effective, achieving state-of-the-art F1-score improvements on the IHC, SBIC, and DH datasets, respectively. Cross-dataset evaluations further demonstrated the superior generalization of intent-tagged models compared to other pre-trained approaches. Then, through a user study (N=100), we evaluated seven moderation settings, including intent tags, group tags, model probabilities, and randomized counterparts. Intent annotations significantly improved the accuracy of the moderators, allowing them to outperform machine classifiers by 12.9%. Moderators also rated intent tags as the most useful explanation tool, with a 41% increase in perceived helpfulness over the control group. Our findings demonstrate that intent-based annotations enhance both machine classification performance and human moderation workflows.

pdf bib
Personas with Attitudes: Controlling LLMs for Diverse Data Annotation
Leon Fröhling | Gianluca Demartini | Dennis Assenmacher

We present a novel approach for enhancing diversity and control in data annotation tasks by personalizing large language models (LLMs). We investigate the impact of injecting diverse persona descriptions into LLM prompts across two studies, exploring whether personas increase annotation diversity and whether the impacts of individual personas on the resulting annotations are consistent and controllable. Our results indicate that persona-prompted LLMs generate more diverse annotations than LLMs prompted without personas, and that the effects of personas on LLM annotations align with subjective differences in human annotations. These effects are both controllable and repeatable, making our approach a valuable tool for enhancing data annotation in subjective NLP tasks such as toxicity detection.

pdf bib
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt. Generation for Enhanced LLM Content Moderation
Daniel Schwarz | Dmitriy Bespalov | Zhe Wang | Ninad Kulkarni | Yanjun Qi

As large language models (LLMs) become increasingly prevalent, ensuring their robustness against adversarial misuse is crucial. This paper introduces the GAP (Graph of Attacks with Pruning) framework, an advanced approach for generating stealthy jailbreak prompts to evaluate and enhance LLM safeguards. GAP addresses limitations in existing tree-based methods by implementing an interconnected graph structure that enables knowledge sharing across attack paths. Our experimental evaluation demonstrates GAP’s superiority over existing techniques, achieving a 20.8% increase in attack success rates while reducing query costs by 62.7%. GAP consistently outperforms state-of-the-art methods across various open and closed LLMs, with attack success rates of 96%. Additionally, we present specialized variants like GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks. GAP-generated prompts prove highly effective in improving content moderation systems, increasing true positive detection rates by 108.5% and accuracy by 183.6% when used for fine-tuning.

pdf bib
A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance
Matteo Melis | Gabriella Lapesa | Dennis Assenmacher

Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 conceptual elements—building blocks that capture different aspects of hate speech definitions, such as references to the target of hate. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.

pdf bib
Red-Teaming for Uncovering Societal Bias in Large Language Models
Chu Fei Luo | Ahmad Ghawanmeh | Kashyap Coimbatore Murali | Bhimshetty Bharat Kumar | Murli Jadhav | Xiaodan Zhu | Faiza Khan Khattak

Ensuring the safe deployment of AI systems is critical in industry settings where biased outputs can lead to significant operational, reputational, and regulatory risks. Thorough evaluation before deployment is essential to prevent these hazards. Red-teaming addresses this need by employing adversarial attacks to develop guardrails that detect and reject biased or harmful queries, enabling models to be retrained or steered away from harmful outputs. However, red-teaming techniques are often limited, and malicious actors may discover new vulnerabilities that bypass safety fine-tuning, underscoring the need for ongoing research and innovative approaches. Notably, most red-teaming efforts focus on harmful or unethical instructions rather than addressing social bias, leaving this critical area under-explored despite its significant real-world impact, especially in customer-facing AI systems. We propose two bias-specific red-teaming methods, Emotional Bias Probe (EBP) and BiasKG, to evaluate how standard safety measures for harmful content mitigate bias. For BiasKG, we refactor natural language stereotypes into a knowledge graph. and use adversarial attacking strategies to induce biased responses from several open- and closed-source language models. We find our method increases bias in all models, even those trained with safety guardrails. Our work emphasizes uncovering societal bias in LLMs through rigorous evaluation, addressing adversarial challenges to ensure AI safety in high-stakes industry deployments.

pdf bib
Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification
Sebastian Loftus | Adrian Mülthaler | Sanne Hoeken | Sina Zarrieß | Ozge Alacam

Annotator disagreement poses a significant challenge in subjective tasks like hate speech detection. In this paper, we introduce a novel variant of the HateWiC task that explicitly models annotator agreement by estimating the proportion of annotators who classify the meaning of a term as hateful. To tackle this challenge, we explore the use of Llama 3 models fine-tuned through Direct Preference Optimization (DPO). Our experiments show that while LLMs perform well for majority-based hate classification, they struggle with the more complex agreement-aware task. DPO fine-tuning offers improvements, particularly when applied to instruction-tuned models. Yet, our results emphasize the need for improved modeling of subjectivity in hate classification and this study can serve as foundation for future advancements.