Zhaoyang Han

Other people with similar names: ZhaoYang Han

2026

Experience-Driven Multi-Agent Optimization for Black-Box Jailbreak Attacks on Large Language Models
Zhaoyang Han | Yihe Liu | Kai Zhang | Ping Li
Findings of the Association for Computational Linguistics: ACL 2026

The rapid discovery of jailbreak prompts has revealed the alarming fragility of safety alignment in frontier large language models (LLMs). While jailbreak techniques play a critical role in red-teaming and safety evaluation, existing methods exhibit three key limitations: (i) poor transferability across model families, requiring model-specific manual tuning; (ii) heavy reliance on large-scale prompt enumeration or exhaustive search, causing prohibitive query costs and poor scalability; and (iii) high sensitivity to input preprocessing or refusal-oriented fine-tuning, leading to attack failures once the underlying model is updated. To address these, we propose Experience-driven Multi-agent Jailbreak Optimization (EMJO), which couples three collaborating agents (Attacker, Analyzer, and Judge) into a closed-loop “probe–evaluate–revise” process, together with a dynamic experience bank accumulating high-quality successful prompts and reusable strategy patterns across iterations and tasks. This design enables query-efficient and transferable jailbreak optimization under black-box access. Extensive experiments on diverse LLMs demonstrate that EMJO consistently outperforms existing black-box jailbreak baselines, achieving up to 11% absolute improvement in attack success rate while reducing the average query cost by up to 7.9× across two benchmark datasets. These results indicate that EMJO offers an effective and scalable paradigm for systematic jailbreak discovery.

Co-authors

Venues

Findings1

Fix author