DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models
Xu Zhang, Xunjian Yin, Dinghao Jing, Huixuan Zhang, Xinyu Hu, Xiaojun Wan
Abstract
While large language models (LLMs) demonstrate remarkable capabilities across a wide range of tasks, they remain vulnerable to generating outputs that are potentially harmful. Red teaming, which involves crafting adversarial inputs to expose vulnerabilities, is a widely adopted approach for evaluating the robustness of these models. Prior studies have indicated that LLMs are susceptible to vulnerabilities exposed through multi-turn interactions as opposed to single-turn scenarios. Nevertheless, existing methods for multi-turn attacks mainly utilize a predefined dialogue pattern, limiting their effectiveness in realistic situations. Effective attacks require adaptive dialogue strategies that respond dynamically to the initial user prompt and the evolving context of the conversation. To address these limitations, we propose DAMON, a novel multi-turn jailbreak attack method. DAMON leverages Monte Carlo Tree Search (MCTS) to systematically explore multi-turn conversational spaces, efficiently identifying sub-instruction sequences that induce harmful responses. We evaluate DAMON’s efficacy across five LLMs and three datasets. Our experimental results show that DAMON can effectively induce undesired behaviors.- Anthology ID:
- 2025.emnlp-main.323
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6361–6377
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.323/
- DOI:
- Cite (ACL):
- Xu Zhang, Xunjian Yin, Dinghao Jing, Huixuan Zhang, Xinyu Hu, and Xiaojun Wan. 2025. DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6361–6377, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models (Zhang et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.323.pdf