Weijuan Zhang
2026
JailMeter: An Evidence-Based Evaluation Framework for Jailbreak Attacks on Large Language Models
Qingjia Huang | Jingyu Zhang | Jianguo Wu | Yakai Li | Weijuan Zhang | Yankai Rong | Junyi Yao | Shengzhi Zhang | Xiaoqi Jia
Findings of the Association for Computational Linguistics: ACL 2026
Qingjia Huang | Jingyu Zhang | Jianguo Wu | Yakai Li | Weijuan Zhang | Yankai Rong | Junyi Yao | Shengzhi Zhang | Xiaoqi Jia
Findings of the Association for Computational Linguistics: ACL 2026
The assessment of jailbreak attacks against large language models currently suffers from inconsistent evaluation criteria and methods, leading to unreliable estimates of attack success rates. We propose JailMeter, an evidence-based evaluation framework designed to more faithfully measure jailbreak effectiveness. Inspired by the Information Bottleneck theory, JailMeter applies dual-feedback optimization to filter jailbreak noise from model responses while preserving content relevant to the original malicious question. This process produces concise evidence for a rigorous assessment under which an attack is validated only when the response captures the malicious intent and delivers a complete answer, thereby signaling a substantive bypass of model safety alignment. We evaluate JailMeter on JailMeter-Eva, a challenging benchmark containing 330 human-labeled, non-rejected jailbreak instances. JailMeter achieves an accuracy of 97.27%, substantially outperforming existing evaluation methods. To support large-scale evaluation, we further distill JailMeter into a small language model, JailMeterSLM, which maintains comparable reliability with significantly reduced computational costs. Code and dataset are available at https://github.com/Magi2B0y/JailMeter.