Weiyang Guo
2025
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
Weiyang Guo
|
Jing Li
|
Wenya Wang
|
Yu Li
|
Daojing He
|
Jun Yu
|
Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the Multi-Turn Safety Alignment (MTSA) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.
Search
Fix author
Co-authors
- Daojing He 1
- Jing Li (李婧) 1
- Yu Li (李豫) 1
- Wenya Wang 1
- Jun Yu 1
- show all...
Venues
- acl1