Xiaotian Zou
2025
Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations
Yongkang Chen
|
Xiaohu Du
|
Xiaotian Zou
|
Chongyang Zhao
|
Huan Deng
|
Hu Li
|
Xiaohui Kuang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The responsible deployment of Large Language Models (LLMs) necessitates rigorous safety evaluations. However, a critical challenge arises from inconsistencies between an LLM’s internal refusal decisions and external safety assessments, hindering effective validation. This paper introduces the concept of the ‘refusal gap’ to formally define these discrepancies. We then present a novel, refusal-aware red teaming framework designed to automatically generate test cases that expose such gaps. Our framework employs ‘refusal probes’, which leverage the target model’s hidden states, to detect internal model refusals. These are subsequently contrasted with judgments from an external safety evaluator. The identified discrepancy serves as a signal to guide a red-teaming model in crafting test cases that maximize this refusal gap. To further enhance test case diversity and address challenges related to sparse rewards, we introduce a hierarchical, curiosity-driven mechanism that incentivizes both refusal gap maximization and broad topic exploration. Empirical results demonstrate that our method significantly outperforms existing reinforcement learning-based approaches in generating diverse test cases and achieves a substantially higher discovery rate of refusal gaps.
Search
Fix author
Co-authors
- Yongkang Chen 1
- Huan Deng 1
- Xiaohu Du 1
- Xiaohui Kuang 1
- Hu Li 1
- show all...