Modeling Human Adversarial Strategy Adaptation in Multi-Turn Language Model Interactions

Zijun Ding


Abstract
Adversarial red teaming is a central component of large language model (LLM) safety evaluation. While prior work has cataloged attack types and measured aggregate failure rates, less attention has been paid to the structured decision-making behavior of human attackers in multi-turn interaction. In this work, we model adversarial dialogue as a hierarchical and sequential process. We introduce a structured representation that decomposes red teaming conversations into goals, strategies, and tactics, where strategies capture distinct vulnerability dimensions and tactics operationalize these strategies at the linguistic level. Using 38,961 multi-turn conversations from a large-scale red teaming dataset, we analyze both first-turn strategy effects and multi-turn adaptation dynamics. Causal estimation reveals systematic differences in success rates across strategic categories. Predictive modeling further shows that incorporating structured strategy, tactic, and adaptation features improves AUC from 0.719 to 0.746 over a baseline without structure. Our findings suggest that adversarial effectiveness is not uniform but varies across structured vulnerability dimensions, and that modeling red teaming as sequential strategic interaction provides measurable explanatory and predictive gains.
Anthology ID:
2026.conll-main.22
Volume:
Proceedings of the 30th Conference on Computational Natural Language Learning
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Claire Bonial, Yevgeni Berzak
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
382–394
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.22/
DOI:
Bibkey:
Cite (ACL):
Zijun Ding. 2026. Modeling Human Adversarial Strategy Adaptation in Multi-Turn Language Model Interactions. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 382–394, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Modeling Human Adversarial Strategy Adaptation in Multi-Turn Language Model Interactions (Ding, CoNLL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.22.pdf