DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models

Xu Zhang (张旭); Xunjian Yin; Dinghao Jing; Huixuan Zhang; Xinyu Hu; Xiaojun Wan

DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models

Xu Zhang, Xunjian Yin, Dinghao Jing, Huixuan Zhang, Xinyu Hu, Xiaojun Wan

Abstract

While large language models (LLMs) demonstrate remarkable capabilities across a wide range of tasks, they remain vulnerable to generating outputs that are potentially harmful. Red teaming, which involves crafting adversarial inputs to expose vulnerabilities, is a widely adopted approach for evaluating the robustness of these models. Prior studies have indicated that LLMs are susceptible to vulnerabilities exposed through multi-turn interactions as opposed to single-turn scenarios. Nevertheless, existing methods for multi-turn attacks mainly utilize a predefined dialogue pattern, limiting their effectiveness in realistic situations. Effective attacks require adaptive dialogue strategies that respond dynamically to the initial user prompt and the evolving context of the conversation. To address these limitations, we propose DAMON, a novel multi-turn jailbreak attack method. DAMON leverages Monte Carlo Tree Search (MCTS) to systematically explore multi-turn conversational spaces, efficiently identifying sub-instruction sequences that induce harmful responses. We evaluate DAMON’s efficacy across five LLMs and three datasets. Our experimental results show that DAMON can effectively induce undesired behaviors.

Anthology ID:: 2025.emnlp-main.323
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6361–6377
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.323/
DOI:
Bibkey:
Cite (ACL):: Xu Zhang, Xunjian Yin, Dinghao Jing, Huixuan Zhang, Xinyu Hu, and Xiaojun Wan. 2025. DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6361–6377, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.323.pdf
Checklist:: 2025.emnlp-main.323.checklist.pdf

PDF Cite Search Checklist Fix data