Junjie Chu
2025
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu
|
Yugeng Liu
|
Ziqing Yang
|
Xinyue Shen
|
Michael Backes
|
Yang Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jailbreak attacks aim to bypass the LLMs’ safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation—either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks, which could achieve high attack success rates but are easy to mitigate by defenses. Our study offers valuable insights for future research on jailbreak attacks and defenses and serves as a benchmark tool for researchers and practitioners to evaluate them effectively.
2024
Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
Junjie Chu
|
Zeyang Sha
|
Michael Backes
|
Yang Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Significant advancements have recently been made in large language models, represented by GPT models.Users frequently have multi-round private conversations with cloud-hosted GPT models for task optimization.Yet, this operational paradigm introduces additional attack surfaces, particularly in custom GPTs and hijacked chat sessions.In this paper, we introduce a straightforward yet potent Conversation Reconstruction Attack.This attack targets the contents of previous conversations between GPT models and benign users, i.e., the benign users’ input contents during their interaction with GPT models.The adversary could induce GPT models to leak such contents by querying them with designed malicious prompts.Our comprehensive examination of privacy risks during the interactions with GPT models under this attack reveals GPT-4’s considerable resilience.We present two advanced attacks targeting improved reconstruction of past conversations, demonstrating significant privacy leakage across all models under these advanced techniques.Evaluating various defense mechanisms, we find them ineffective against these attacks.Our findings highlight the ease with which privacy can be compromised in interactions with GPT models, urging the community to safeguard against potential abuses of these models’ capabilities.
Search
Fix author
Co-authors
- Michael Backes 2
- Yang Zhang 2
- Yugeng Liu 1
- Zeyang Sha 1
- Xinyue Shen 1
- show all...