How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Zhuohan Long; Siyuan Wang (王思远); Shujun Liu; Yuhang Lai

doi:10.18653/v1/2025.findings-emnlp.1160

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Zhuohan Long, Siyuan Wang, Shujun Liu, Yuhang Lai

Abstract

Jailbreak attacks, where harmful prompts bypass generative models’ built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model’s ability to differentiate between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies—inter-mechanism and intra-mechanism ensembles—to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

Anthology ID:: 2025.findings-emnlp.1160
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21263–21290
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1160/
DOI:: 10.18653/v1/2025.findings-emnlp.1160
Bibkey:
Cite (ACL):: Zhuohan Long, Siyuan Wang, Shujun Liu, and Yuhang Lai. 2025. How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21263–21290, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation (Long et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1160.pdf
Checklist:: 2025.findings-emnlp.1160.checklist.pdf

PDF Cite Search Checklist Fix data