XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content
Vadivel Abishethvarman, Bhavik Chandna, Pratik Jalan, Usman Naseem
Abstract
Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe/unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs on a multi-level grading. It includes 3,840 red-teaming prompts generated using templates informed by real-world extremist scenarios from social media, forums, and news. The framework categorizes model responses into five danger levels (0–4) defined by degree of extremist endorsement, enabling nuanced analysis of failure frequency and severity. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate five popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs. The code and dataset is available at https://github.com/Abishethvarman/XGUARD- Anthology ID:
- 2026.findings-acl.1576
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 31492–31510
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1576/
- DOI:
- Cite (ACL):
- Vadivel Abishethvarman, Bhavik Chandna, Pratik Jalan, and Usman Naseem. 2026. XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content. In Findings of the Association for Computational Linguistics: ACL 2026, pages 31492–31510, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content (Abishethvarman et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1576.pdf