Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh, David Zhang, Eric Hsin, Li Chen, Ankit Jain, Matt Fredrikson, Akash Bharadwaj


Abstract
This paper presents a novel Automated Red Teaming (ART) framework that shifts from example-based to policy-based evaluation, addressing critical limitations in scalability and validity. We define harmful content through abstract safety policies rather than specific static examples. We also introduce multiple evaluation objectives: risk coverage, semantic diversity, and fidelity, and discover Pareto trade-offs between them. We propose Jailbreak-Zero, a black-box method capable of both zero-shot generation and fine-tuned exploitation of a victim’s vulnerabilities to achieve Pareto optimality. Unlike prior approaches, it does not require expert-designed strategies/prompts, but still achieves superior, human-readable attacks against open-source and proprietary models (attack success rates of 99.5% against GPT-4o and 96.0% against Claude 3.5), even for unseen safety policies. It retains efficacy even after victim models undergo safety alignment, and exposes controls to navigate Pareto trade-offs without retraining. Lastly, we show that Jailbreak-Zero is the best-performing ART method at a given compute budget. Code is available at: https://github.com/hukkai/jailbreak-zero/ .
Anthology ID:
2026.acl-long.2167
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
46720–46746
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2167/
DOI:
Bibkey:
Cite (ACL):
Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh, David Zhang, Eric Hsin, Li Chen, Ankit Jain, Matt Fredrikson, and Akash Bharadwaj. 2026. Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46720–46746, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models (Hu et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2167.pdf
Checklist:
 2026.acl-long.2167.checklist.pdf