Xirui Li


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers
Xirui Li | Ruochen Wang | Minhao Cheng | Tianyi Zhou | Cho-Jui Hsieh
Findings of the Association for Computational Linguistics: EMNLP 2024

Safety-aligned Large Language Models (LLMs) are still vulnerable to some manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, existing jailbreaking methods usually view a harmful prompt as a whole but they are not effective at reducing LLMs’ attention on combinations of words with malice, which well-aligned LLMs can easily reject. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively reduce LLMs’ attention on harmful words by presenting them to LLMs in a fragmented form, thereby addressing these limitations and improving attack effectiveness. We introduce an automatic prompt Decomposition and Reconstruction framework for jailbreaking Attack (DrAttack). DrAttack consists of three key components: (a) ‘Decomposition’ of the original prompt into sub-prompts, (b) ‘Reconstruction’ of these sub-prompts implicitly by In-Context Learning with semantically similar but benign reassembling example, and (c) ‘Synonym Search’ of sub-prompts, aiming to find sub-prompts’ synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with fewer queries, DrAttack obtains a substantial gain of success rate on powerful LLMs over prior SOTA attackers. Notably, the success rate of 80% on GPT-4 surpassed previous art by 65%. Code and data are made publicly available at https://turningpoint-ai.github.io/DrAttack/.