Better Red Teaming via Searching with Large Language Model
Yongkang Chen, Chongyang Zhao, Jianwentian Jianwentian, Guiling Cao, Hu Li, Xiaohui Kuang
Abstract
The safe deployment of large language models (LLMs) necessitates comprehensive safety evaluations through red teaming. However, existing methods face challenges in managing semantic intricacies and optimizing the efficiency of the search process. To overcome these limitations, we propose Better Red Teaming (BRT)—an innovative framework that reconceptualizes test case generation as a strategic planning problem, leveraging Monte Carlo Tree Search (MCTS). A notable advancement of our approach is the incorporation of LLMs as world models, enabling the prediction of state transitions and simulation of long-term outcomes throughout the search process. By jointly optimizing objectives related to conditional mutual information and diversity, we improve the world model’s capacity to follow actions while maintaining output diversity. Extensive experiments conducted across a range of LLM architectures demonstrate that BRT achieves state-of-the-art attack success rates without sacrificing computational efficiency.- Anthology ID:
- 2025.findings-acl.257
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4968–4984
- Language:
- URL:
- https://preview.aclanthology.org/corrections-2025-08/2025.findings-acl.257/
- DOI:
- 10.18653/v1/2025.findings-acl.257
- Cite (ACL):
- Yongkang Chen, Chongyang Zhao, Jianwentian Jianwentian, Guiling Cao, Hu Li, and Xiaohui Kuang. 2025. Better Red Teaming via Searching with Large Language Model. In Findings of the Association for Computational Linguistics: ACL 2025, pages 4968–4984, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Better Red Teaming via Searching with Large Language Model (Chen et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/corrections-2025-08/2025.findings-acl.257.pdf