JailMeter: An Evidence-Based Evaluation Framework for Jailbreak Attacks on Large Language Models

Qingjia Huang, Jingyu Zhang, Jianguo Wu, Yakai Li, Weijuan Zhang, Yankai Rong, Junyi Yao, Shengzhi Zhang, Xiaoqi Jia


Abstract
The assessment of jailbreak attacks against large language models currently suffers from inconsistent evaluation criteria and methods, leading to unreliable estimates of attack success rates. We propose JailMeter, an evidence-based evaluation framework designed to more faithfully measure jailbreak effectiveness. Inspired by the Information Bottleneck theory, JailMeter applies dual-feedback optimization to filter jailbreak noise from model responses while preserving content relevant to the original malicious question. This process produces concise evidence for a rigorous assessment under which an attack is validated only when the response captures the malicious intent and delivers a complete answer, thereby signaling a substantive bypass of model safety alignment. We evaluate JailMeter on JailMeter-Eva, a challenging benchmark containing 330 human-labeled, non-rejected jailbreak instances. JailMeter achieves an accuracy of 97.27%, substantially outperforming existing evaluation methods. To support large-scale evaluation, we further distill JailMeter into a small language model, JailMeterSLM, which maintains comparable reliability with significantly reduced computational costs. Code and dataset are available at https://github.com/Magi2B0y/JailMeter.
Anthology ID:
2026.findings-acl.786
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16006–16029
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.786/
DOI:
Bibkey:
Cite (ACL):
Qingjia Huang, Jingyu Zhang, Jianguo Wu, Yakai Li, Weijuan Zhang, Yankai Rong, Junyi Yao, Shengzhi Zhang, and Xiaoqi Jia. 2026. JailMeter: An Evidence-Based Evaluation Framework for Jailbreak Attacks on Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 16006–16029, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
JailMeter: An Evidence-Based Evaluation Framework for Jailbreak Attacks on Large Language Models (Huang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.786.pdf
Checklist:
 2026.findings-acl.786.checklist.pdf