XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content

Vadivel Abishethvarman, Bhavik Chandna, Pratik Jalan, Usman Naseem


Abstract
Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe/unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs on a multi-level grading. It includes 3,840 red-teaming prompts generated using templates informed by real-world extremist scenarios from social media, forums, and news. The framework categorizes model responses into five danger levels (0–4) defined by degree of extremist endorsement, enabling nuanced analysis of failure frequency and severity. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate five popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs. The code and dataset is available at https://github.com/Abishethvarman/XGUARD
Anthology ID:
2026.findings-acl.1576
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31492–31510
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1576/
DOI:
Bibkey:
Cite (ACL):
Vadivel Abishethvarman, Bhavik Chandna, Pratik Jalan, and Usman Naseem. 2026. XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content. In Findings of the Association for Computational Linguistics: ACL 2026, pages 31492–31510, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content (Abishethvarman et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1576.pdf
Checklist:
 2026.findings-acl.1576.checklist.pdf