Query-Efficient Black-Box Red Teaming via Bayesian Optimization
Deokjae Lee, JunYeong Lee, Jung-Woo Ha, Jin-Hwa Kim, Sang-Woo Lee, Hwaran Lee, Hyun Oh Song
Abstract
The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods.The source code is available at https://github.com/snu-mllab/Bayesian-Red-Teaming.- Anthology ID:
- 2023.acl-long.646
- Volume:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11551–11574
- Language:
- URL:
- https://aclanthology.org/2023.acl-long.646
- DOI:
- 10.18653/v1/2023.acl-long.646
- Cite (ACL):
- Deokjae Lee, JunYeong Lee, Jung-Woo Ha, Jin-Hwa Kim, Sang-Woo Lee, Hwaran Lee, and Hyun Oh Song. 2023. Query-Efficient Black-Box Red Teaming via Bayesian Optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11551–11574, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Query-Efficient Black-Box Red Teaming via Bayesian Optimization (Lee et al., ACL 2023)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2023.acl-long.646.pdf