SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

Kaiwen Zhou; Ahmed Elgohary; A S M Iftekhar; Amin Saied

SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

Kaiwen Zhou, Ahmed Elgohary, A S M Iftekhar, Amin Saied

Abstract

The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ, a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover diverse risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model’s reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 – 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.

Anthology ID:: 2026.findings-eacl.171
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3269–3292
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.171/
DOI:
Bibkey:
Cite (ACL):: Kaiwen Zhou, Ahmed Elgohary, A S M Iftekhar, and Amin Saied. 2026. SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning. In Findings of the Association for Computational Linguistics: EACL 2026, pages 3269–3292, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning (Zhou et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.171.pdf
Checklist:: 2026.findings-eacl.171.checklist.pdf

PDF Cite Search Checklist Fix data