Weipeng Jiang

2025

pdf bib abs
An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer
Weipeng Jiang | Zhenting Wang | Juan Zhai | Shiqing Ma | Zhengyu Zhao | Chao Shen
Findings of the Association for Computational Linguistics: NAACL 2025

Despite prior safety alignment efforts, LLMs can still generate harmful and unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall into two main categories: template-based and optimization-based methods. The former requires significant manual effort and domain knowledge, while the latter, exemplified by GCG, which seeks to maximize the likelihood of harmful LLM outputs through token-level optimization, also encounters several limitations: requiring white-box access, necessitating pre-constructed affirmative phrase, and suffering from low efficiency. This paper introduces ECLIPSE, a novel and efficient black-box jailbreaking method with optimizable suffixes. We employ task prompts to translate jailbreaking objectives into natural language instructions, guiding LLMs to generate adversarial suffixes for malicious queries. A harmfulness scorer provides continuous feedback, enabling LLM self-reflection and iterative optimization to autonomously produce effective suffixes. Experimental results demonstrate that ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly outperforming GCG by 2.4 times. Moreover, ECLIPSE matches template-based methods in ASR while substantially reducing average attack overhead by 83%, offering superior attack efficiency.

Co-authors

Venues

findings1

Fix data