Exploring Straightforward Methods for Automatic Conversational Red-Teaming

George Kour, Naama Zwerdling, Marcel Zalmanovici, Ateret Anaby Tavor, Ora Nova Fandina, Eitan Farchi


Abstract
Large language models (LLMs) are increasingly used in business dialogue systems but they also pose security and ethical risks. Multi-turn conversations, in which context influences the model’s behavior, can be exploited to generate undesired responses. In this paper, we investigate the use of off-the-shelf LLMs in conversational red-teaming settings, where an attacker LLM attempts to elicit undesired outputs from a target LLM. Our experiments address critical questions and offer valuable insights regarding the effectiveness of using LLMs as automated red-teamers, shedding light on key strategies and usage approaches that significantly impact their performance.Our findings demonstrate that off-the-shelf models can serve as effective red-teamers, capable of adapting their attack strategies based on prior attempts. Allowing these models to freely steer conversations and conceal their malicious intent further increases attack success. However, their effectiveness decreases as the alignment of the target model improves.
Anthology ID:
2025.naacl-industry.10
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–128
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.naacl-industry.10/
DOI:
Bibkey:
Cite (ACL):
George Kour, Naama Zwerdling, Marcel Zalmanovici, Ateret Anaby Tavor, Ora Nova Fandina, and Eitan Farchi. 2025. Exploring Straightforward Methods for Automatic Conversational Red-Teaming. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 112–128, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Exploring Straightforward Methods for Automatic Conversational Red-Teaming (Kour et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.naacl-industry.10.pdf