Xiaojun Jia
2026
Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models
Weiwei Qi | Zefeng Wu | Tianhang Zheng | Zikang Zhang | Xiaojun Jia | Zhan Qin | Kui Ren
Findings of the Association for Computational Linguistics: ACL 2026
Weiwei Qi | Zefeng Wu | Tianhang Zheng | Zikang Zhang | Xiaojun Jia | Zhan Qin | Kui Ren
Findings of the Association for Computational Linguistics: ACL 2026
Ensuring Large Language Model (LLM) safety is crucial, yet the lack of a clear understanding about safety mechanisms hinders the development of precise and reliable methodologies for safety intervention across diverse tasks. To better understand and control LLM safety, we propose the Expected Safety Impact (ESI) framework for quantifying how different parameters affect LLM safety. Based on ESI, we reveal distinct safety-critical patterns across different LLM architectures: In dense LLMs, many safety-critical parameters are located in value matrices (V) and MLPs in middle layers, whereas in Mixture-of-Experts (MoE) models, they shift to late-layer MLPs. Leveraging ESI, we further introduce two targeted intervention paradigms for safety enhancement and preservation, i.e., Safety Enhancement Tuning (SET) and Safety Preserving Adaptation (SPA). SET can align unsafe LLMs by updating only a few safety-critical parameters, effectively enhancing safety while preserving original performance. SPA safeguards well-aligned LLMs during capability-oriented intervention (e.g., instruction tuning) by preventing disruption of safety-critical weights, allowing the LLM to acquire new abilities while maintaining safety capabilities. Extensive evaluations on different LLMs demonstrate that SET can reduce the attack success rates of unaligned LLMs by over 50% with only a 100-iteration update on 1% of model weights. SPA can limit the safety degradation of aligned LLMs within 1% after a 1,000-iteration instruction fine-tuning on different tasks. Our code is available at: https://github.com/ZJU-LLM-Safety/SafeWeights-ACL
GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
Xiangdong Hu | Yangyang Jiang | Qin Hu | Xiaojun Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangdong Hu | Yangyang Jiang | Qin Hu | Xiaojun Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model’s own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBIT (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines. Warning: This paper contains unsafe and offensive examples.
Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning
Zhiyuan Chang | Mingyang Li | Yuekai Huang | Ziyou Jiang | Xiaojun Jia | Qian Xiong | Junjie Wang | Zhaoyang Li | Qing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Zhiyuan Chang | Mingyang Li | Yuekai Huang | Ziyou Jiang | Xiaojun Jia | Qian Xiong | Junjie Wang | Zhaoyang Li | Qing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation.
2025
Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings
Yuanhe Zhang | Zhenhong Zhou | Wei Zhang | Xinyue Wang | Xiaojun Jia | Yang Liu | Sen Su
Findings of the Association for Computational Linguistics: ACL 2025
Yuanhe Zhang | Zhenhong Zhou | Wei Zhang | Xinyue Wang | Xiaojun Jia | Yang Liu | Sen Su
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks yet still are vulnerable to external threats, particularly LLM Denial-of-Service (LLM-DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust computational resources and block services. However, existing studies predominantly focus on white-box attacks, leaving black-box scenarios underexplored. In this paper, we introduce Auto-Generation for LLM-DoS (AutoDoS) attack, an automated algorithm designed for black-box LLMs. AutoDoS constructs the DoS Attack Tree and expands the node coverage to achieve effectiveness under black-box conditions. By transferability-driven iterative optimization, AutoDoS could work across different models in one prompt.Furthermore, we reveal that embedding the Length Trojan allows AutoDoS to bypass existing defenses more effectively.Experimental results show that AutoDoS significantly amplifies service response latency by over 250×↑, leading to severe resource consumption in terms of GPU utilization and memory usage. Our work provides a new perspective on LLM-DoS attacks and security defenses.
Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization
Yihao Huang | Chong Wang | Xiaojun Jia | Qing Guo | Felix Juefei-Xu | Jian Zhang | Yang Liu | Geguang Pu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yihao Huang | Chong Wang | Xiaojun Jia | Qing Guo | Felix Juefei-Xu | Jian Zhang | Yang Liu | Geguang Pu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Universal goal hijacking is a kind of prompt injection attack that forces LLMs to return a target malicious response for arbitrary normal user prompts. The previous methods achieve high attack performance while being too cumbersome and time-consuming. Also, they have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To this end, we propose a method called POUGH that incorporates an efficient optimization algorithm and two semantics-guided prompt organization strategies. Specifically, our method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes them. Given the sequentially ranked prompts, our method employs an iterative optimization algorithm to generate a fixed suffix that can concatenate to arbitrary user prompts for universal goal hijacking. Experiments conducted on four popular LLMs and ten types of target responses verified the effectiveness.
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization
Ruoxi Cheng | Yizhong Ding | Shuirong Cao | Ranjie Duan | Xiaoshuang Jia | Shaowei Yuan | Simeng Qin | Zhiqiang Wang | Xiaojun Jia
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ruoxi Cheng | Yizhong Ding | Shuirong Cao | Ranjie Duan | Xiaoshuang Jia | Shaowei Yuan | Simeng Qin | Zhiqiang Wang | Xiaojun Jia
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model. Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs. Disclaimer: This paper contains potentially disturbing and offensive content.
TUNI: A Textual Unimodal Detector for Identity Inference in CLIP Models
Songze Li | Ruoxi Cheng | Xiaojun Jia
Proceedings of the Sixth Workshop on Privacy in Natural Language Processing
Songze Li | Ruoxi Cheng | Xiaojun Jia
Proceedings of the Sixth Workshop on Privacy in Natural Language Processing
The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of PII. Existing methods for identity inference in CLIP models require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, applying images may risk exposing personal information to target models, as the image might not have been previously encountered by the target model.Additionally, previous MIAs train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel technique for identity inference that: 1) only utilizes text data to query the target model; and 2) eliminates the need for training shadow models. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data.
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
Zhiyuan Chang | Mingyang Li | Xiaojun Jia | Junjie Wang | Yuekai Huang | Ziyou Jiang | Yang Liu | Qing Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Zhiyuan Chang | Mingyang Li | Xiaojun Jia | Junjie Wang | Yuekai Huang | Ziyou Jiang | Yang Liu | Qing Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. While previous studies have exposed knowledge poisoning risks in RAG systems, existing attack methods suffer from critical limitations: they either require injecting multiple poisoned documents (resulting in poor stealthiness) or can only function effectively on simplistic queries (limiting real-world applicability). This paper reveals a more realistic knowledge poisoning attack against RAG systems that achieves successful attacks by poisoning only a single document while remaining effective for complex multi-hop questions involving complex relationships between multiple elements. Our proposed AuthChain address three challenges to ensure the poisoned documents are reliably retrieved and trusted by the LLM, even against large knowledge bases and LLM’s own knowledge. Extensive experiments across six popular LLMs demonstrate that AuthChain achieves significantly higher attack success rates while maintaining superior stealthiness against RAG defense mechanisms compared to state-of-the-art baselines.
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization
Ruoxi Cheng | Yizhong Ding | Shuirong Cao | Ranjie Duan | Xiaoshuang Jia | Shaowei Yuan | Zhiqiang Wang | Xiaojun Jia
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Ruoxi Cheng | Yizhong Ding | Shuirong Cao | Ranjie Duan | Xiaoshuang Jia | Shaowei Yuan | Zhiqiang Wang | Xiaojun Jia
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model.Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs.redDisclaimer: This paper contains potentially disturbing and offensive content.
LLM Jailbreak Detection for (Almost) Free!
Guorui Chen | Yifan Xia | Xiaojun Jia | Zhijiang Li | Philip Torr | Jindong Gu
Findings of the Association for Computational Linguistics: EMNLP 2025
Guorui Chen | Yifan Xia | Xiaojun Jia | Zhijiang Li | Philip Torr | Jindong Gu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we first present a finding that the difference in output distributions between jailbreak and benign prompts can be employed for detecting jailbreak prompts. Based on this finding, we propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to distinguish between jailbreak and benign prompts through the confidence of the first token. Furthermore, we enhance the detection performance of FJD through the integration of virtual instruction learning. Extensive experiments on aligned LLMs show that our FJD can effectively detect jailbreak prompts with almost no additional computational costs during LLM inference.
Search
Fix author
Co-authors
- Ruoxi Cheng 3
- Yang Liu 3
- Shuirong Cao 2
- Zhiyuan Chang 2
- Yizhong Ding 2
- Ranjie Duan 2
- Yuekai Huang 2
- Xiaoshuang Jia 2
- Ziyou Jiang 2
- Mingyang Li 2
- Junjie Wang 2
- Qing Wang 2
- Zhiqiang Wang (王智强) 2
- Shaowei Yuan 2
- Guorui Chen 1
- Jindong Gu 1
- Qing Guo 1
- Xiangdong Hu 1
- Qin Hu 1
- Yihao Huang 1
- Yangyang Jiang 1
- Felix Juefei-Xu 1
- Zhaoyang Li 1
- Songze Li 1
- Zhijiang Li 1
- Geguang Pu 1
- Weiwei Qi 1
- Zhan Qin 1
- Simeng Qin 1
- Kui Ren 1
- Sen Su 1
- Philip Torr 1
- Xinyue Wang 1
- Chong Wang 1
- Zefeng Wu 1
- Yifan Xia 1
- Qian Xiong 1
- Zikang Zhang 1
- Yuanhe Zhang 1
- Wei Zhang 1
- Jian Zhang 1
- Tianhang Zheng 1
- Zhenhong Zhou 1