Pratik Jalan

2026

XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content
Vadivel Abishethvarman | Bhavik Chandna | Pratik Jalan | Usman Naseem
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe/unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs on a multi-level grading. It includes 3,840 red-teaming prompts generated using templates informed by real-world extremist scenarios from social media, forums, and news. The framework categorizes model responses into five danger levels (0–4) defined by degree of extremist endorsement, enabling nuanced analysis of failure frequency and severity. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate five popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs. The code and dataset is available at https://github.com/Abishethvarman/XGUARD

Co-authors

Venues

Findings1

Fix author