Congyao Mei

2026

Model editing-based jailbreak backdoor attacks against LLMs have gained attention for being lightweight, enabling vulnerability discovery in LLMs. Existing methods are implemented by binding backdoors to predefined phrases as first few output tokens, inducing the LLM’s next-token prediction to produce continuous responses. However, their effectiveness is heavily dependent on the number of bound phrases, with attack costs rising as this number increases. In this work, we propose JEST, which achieves jailbreak backdoor attacks by hijacking LLM representations into a acceptance domain rather than binding to a few output tokens. Specifically, we propose a representation transition-guided model editing to inject jailbreak backdoors into LLMs. The activated backdoor transitions the LLM from rejection domain to acceptance domain, causing it to accept and generate jailbreak behavior. To clearly distinguish between rejection and acceptance domains within LLMs, we also design a domain modeling strategy for JEST that models these two opposing domains within the representation space. Additionally, JEST-hijacked LLMs exhibit greater vulnerability to direct prompt attacks. Experimental results show that JEST outperforms existing model editing methods, demonstrating stronger jailbreak capabilities across various LLMs and datasets. We also provide analysis to explore the safety boundary of LLM.

2025

pdf bib abs

Prompt transfer is a transfer learning method based on prompt tuning, which enhances the parameter performance of prompts in target tasks by transferring source prompt embeddings. Among existing methods, weighted aggregation is effective and possesses the advantages of being lightweight and modular. However, these methods may transfer redundant or irrelevant information from the source prompts to the target prompt, leading to negative impacts. To alleviate this problem, we propose Prompt Contrastive Transformation (PCT), which achieves efficient prompt transfer through prompt contrastive transformation and attentional fusion. PCT transforms the source prompt into task-agnostic embedding and task-specific embeddings through singular value decomposition and contrastive learning, reducing information redundancy among source prompts. The attention module in PCT selects more effective task-specific embeddings and fuses them with task-agnostic embedding into the target prompt. Experimental results show that, despite tuning only 0.035% of task-specific parameters, PCT achieves improvements in prompt transfer for single target task adaptation across various NLP tasks.

pdf bib abs

Prompt tuning for Large Language Models (LLMs) is vulnerable to backdoor attacks. Existing methods find backdoor attacks to be a significant threat in data-rich scenarios. However, in data-limited scenarios, these methods have difficulty capturing precise backdoor patterns, leading to weakened backdoor attack capabilities and significant side effects for the LLMs, which limits their practical relevance. To explore this problem, we propose a backdoor attacks through contrastive-enhanced machine unlearning in data-limited scenarios, called BCU. Specifically, BCU introduces a multi-objective machine unlearning method to capture precise backdoor patterns by forgetting the association between non-trigger data and the backdoor patterns, reducing side effects. Moreover, we design a contrastive learning strategy to enhance the association between triggers and backdoor patterns, improving the capability of backdoor attacks. Experimental results on 6 NLP datasets and 4 LLMs show that BCU exhibits strong backdoor attack capabilities and slight side effects, whether the training data is rich or limited. Our findings highlight practical security risks of backdoor attacks against LLMs, necessitating further research for security purposes. Our code is available at https://github.com/AHU-YangSJ/BCU.

Co-authors

Min Cai 1

Venues

Findings2
TACL1

Fix author