Adaptive Instruction Composition for Automated LLM Red-Teaming

Jesse Zymet; Andy Luo; Swapnil Shinde; Sahil Wadhwa; Emily Chen

Adaptive Instruction Composition for Automated LLM Red-Teaming

Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa, Emily Chen

Abstract

Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.

Anthology ID:: 2026.acl-long.2174
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 46978–46996
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2174/
DOI:
Bibkey:
Cite (ACL):: Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa, and Emily Chen. 2026. Adaptive Instruction Composition for Automated LLM Red-Teaming. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46978–46996, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Adaptive Instruction Composition for Automated LLM Red-Teaming (Zymet et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2174.pdf
Checklist:: 2026.acl-long.2174.checklist.pdf

PDF Cite Search Checklist Fix data