Yongkang Chen
2025
Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations
Yongkang Chen
|
Xiaohu Du
|
Xiaotian Zou
|
Chongyang Zhao
|
Huan Deng
|
Hu Li
|
Xiaohui Kuang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The responsible deployment of Large Language Models (LLMs) necessitates rigorous safety evaluations. However, a critical challenge arises from inconsistencies between an LLM’s internal refusal decisions and external safety assessments, hindering effective validation. This paper introduces the concept of the ‘refusal gap’ to formally define these discrepancies. We then present a novel, refusal-aware red teaming framework designed to automatically generate test cases that expose such gaps. Our framework employs ‘refusal probes’, which leverage the target model’s hidden states, to detect internal model refusals. These are subsequently contrasted with judgments from an external safety evaluator. The identified discrepancy serves as a signal to guide a red-teaming model in crafting test cases that maximize this refusal gap. To further enhance test case diversity and address challenges related to sparse rewards, we introduce a hierarchical, curiosity-driven mechanism that incentivizes both refusal gap maximization and broad topic exploration. Empirical results demonstrate that our method significantly outperforms existing reinforcement learning-based approaches in generating diverse test cases and achieves a substantially higher discovery rate of refusal gaps.
Better Red Teaming via Searching with Large Language Model
Yongkang Chen
|
Chongyang Zhao
|
Jianwentian Jianwentian
|
Guiling Cao
|
Hu Li
|
Xiaohui Kuang
Findings of the Association for Computational Linguistics: ACL 2025
The safe deployment of large language models (LLMs) necessitates comprehensive safety evaluations through red teaming. However, existing methods face challenges in managing semantic intricacies and optimizing the efficiency of the search process. To overcome these limitations, we propose Better Red Teaming (BRT)—an innovative framework that reconceptualizes test case generation as a strategic planning problem, leveraging Monte Carlo Tree Search (MCTS). A notable advancement of our approach is the incorporation of LLMs as world models, enabling the prediction of state transitions and simulation of long-term outcomes throughout the search process. By jointly optimizing objectives related to conditional mutual information and diversity, we improve the world model’s capacity to follow actions while maintaining output diversity. Extensive experiments conducted across a range of LLM architectures demonstrate that BRT achieves state-of-the-art attack success rates without sacrificing computational efficiency.
Search
Fix author
Co-authors
- Xiaohui Kuang 2
- Hu Li 2
- Chongyang Zhao 2
- Guiling Cao 1
- Huan Deng 1
- show all...