Zhenhong Zhou
2026
Backdoor Collapse: Eliminating Unknown Threats Via Known Backdoor Aggregation In Language Models
Liang Lin | Miao Yu | Moayad Aloqaily | Zhenhong Zhou | Kun Wang | Linsey Pang | Prakhar Mehrotra | Qingsong Wen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Liang Lin | Miao Yu | Moayad Aloqaily | Zhenhong Zhou | Kun Wang | Linsey Pang | Prakhar Mehrotra | Qingsong Wen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose Locphylax, a defense framework that requires no prior knowledge of trigger settings. Locphylax is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. Locphylax leverages this through a two-stage process: first, aggregating backdoor representations by injecting known triggers, and then, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) Locphylax reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%–69.3%. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios. Our code is available at https://anonymous.4open.science/r/Locphylax.
SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models
Yuanhe Zhang | Jiayu Tian | Yibo Zhang | Shilinlu Yan | Liang Lin | Zhenhong Zhou | Li Sun | Sen Su
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuanhe Zhang | Jiayu Tian | Yibo Zhang | Shilinlu Yan | Liang Lin | Zhenhong Zhou | Li Sun | Sen Su
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model’s internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
RiskLab: A Controlled Toolkit for Probing Emergent Risks in LLM-Based Multi-Agent Systems
Yu Jiang | Wenjie Wang | Yue Huang | Yanbo Wang | Zhenhong Zhou | Xiuying Chen | Yang Liu | Pin-Yu Chen | Wei Wang | Xiangliang Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Yu Jiang | Wenjie Wang | Yue Huang | Yanbo Wang | Zhenhong Zhou | Xiuying Chen | Yang Liu | Pin-Yu Chen | Wei Wang | Xiangliang Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Large language model (LLM) agents increasingly operate in multi-agent settings where failures emerge from interaction dynamics rather than isolated model errors. We introduce RiskLab, an open-source toolkit for instantiating, probing, and measuring emergent risks in LLM-based multi-agent systems under controlled conditions. Each experiment is defined as a structured topology–environment–protocol–agent–task quintuple, enabling reproducible studies of how communication structure, coordination mechanisms, and incentives shape system-level risks. RiskLab provides flexible communication topologies, swappable interaction protocols, trajectory-grounded evaluation, and extensible registries for risk detectors and agent backends. We demonstrate the toolkit across representative risks, including collusion, resource overreach, semantic drift, and strategic misreporting, and support one-file reproducibility via configuration.
2025
PD3F: A Pluggable and Dynamic DoS-Defense Framework against resource consumption attacks targeting Large Language Models
Yuanhe Zhang | Xinyue Wang | Haoran Gao | Zhenhong Zhou | Fanyu Meng | Yuyao Zhang | Sen Su
Findings of the Association for Computational Linguistics: EMNLP 2025
Yuanhe Zhang | Xinyue Wang | Haoran Gao | Zhenhong Zhou | Fanyu Meng | Yuyao Zhang | Sen Su
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs), due to substantial computational requirements, are vulnerable to resource consumption attacks, which can severely degrade server performance or even cause crashes, as demonstrated by denial-of-service (DoS) attacks designed for LLMs. However, existing works lack mitigation strategies against such threats, resulting in unresolved security risks for real-world LLM deployments. To this end, we propose the Pluggable and Dynamic DoS-Defense Framework (PD3F), which employs a two-stage approach to defend against resource consumption attacks from both the input and output sides. On the input side, we propose the Resource Index to guide Dynamic Request Polling Scheduling, thereby reducing computing resource usage induced by malicious prompts under high-concurrency scenarios. On the output side, we introduce the Adaptive End-Based Suppression mechanism, which reduces excessive malicious generation. Experiments across six models demonstrate that PD3F significantly mitigates resource consumption attacks, improving users’ access capacity by up to 500% during adversarial load. PD3F represents a step toward the resilient and resource-aware deployment of LLMs against resource consumption attacks.
DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent
Pengyu Zhu | Zhenhong Zhou | Yuanhe Zhang | Shilinlu Yan | Kun Wang | Sen Su
Findings of the Association for Computational Linguistics: EMNLP 2025
Pengyu Zhu | Zhenhong Zhou | Yuanhe Zhang | Shilinlu Yan | Kun Wang | Sen Su
Findings of the Association for Computational Linguistics: EMNLP 2025
As LLM-based agents become increasingly prevalent, triggers implanted in user queries or environment feedback can activate hidden backdoors, raising critical concerns about safety vulnerabilities in agents.However, traditional backdoor attacks are often detectable by safety audits that analyze the reasoning process of agents, hindering further progress in agent safety research.To this end, we propose a novel backdoor implantation strategy called Dynamically Encrypted Multi-Backdoor Implantation Attack. Specifically, we introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits.To enhance stealthiness, we further decompose the backdoor into multiple sub-backdoor fragments. Based on these advancements, backdoors are allowed to bypass safety audits significantly.Additionally, we present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks.Experimental results across multiple datasets demonstrate that our method achieves an attack success rate approaching 100% while maintaining a detection rate of 0%, illustrating its effectiveness in evading safety audits.Our findings highlight the limitations of existing safety mechanisms in detecting advanced attacks, underscoring the urgent need for more robust defenses against backdoor threats.Code and data are available at https://github.com/whfeLingYu/DemonAgent.
Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings
Yuanhe Zhang | Zhenhong Zhou | Wei Zhang | Xinyue Wang | Xiaojun Jia | Yang Liu | Sen Su
Findings of the Association for Computational Linguistics: ACL 2025
Yuanhe Zhang | Zhenhong Zhou | Wei Zhang | Xinyue Wang | Xiaojun Jia | Yang Liu | Sen Su
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks yet still are vulnerable to external threats, particularly LLM Denial-of-Service (LLM-DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust computational resources and block services. However, existing studies predominantly focus on white-box attacks, leaving black-box scenarios underexplored. In this paper, we introduce Auto-Generation for LLM-DoS (AutoDoS) attack, an automated algorithm designed for black-box LLMs. AutoDoS constructs the DoS Attack Tree and expands the node coverage to achieve effectiveness under black-box conditions. By transferability-driven iterative optimization, AutoDoS could work across different models in one prompt.Furthermore, we reveal that embedding the Length Trojan allows AutoDoS to bypass existing defenses more effectively.Experimental results show that AutoDoS significantly amplifies service response latency by over 250×↑, leading to severe resource consumption in terms of GPU utilization and memory usage. Our work provides a new perspective on LLM-DoS attacks and security defenses.
2024
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
Zhenhong Zhou | Haiyang Yu | Xinghua Zhang | Rongwu Xu | Fei Huang | Yongbin Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Zhenhong Zhou | Haiyang Yu | Xinghua Zhang | Rongwu Xu | Fei Huang | Yongbin Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Unfortunately, jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content and raising concerns about LLM safety. Due to language models with intensive parameters often regarded as black boxes, the mechanisms of alignment and jailbreak are challenging to elucidate. In this paper, we employ weak classifiers to explain LLM safety through the intermediate hidden states. We first confirm that LLMs learn ethical concepts during pre-training rather than alignment and can identify malicious and normal inputs in the early layers. Alignment actually associates the early concepts with emotion guesses in the middle layers and then refines them to the specific reject tokens for safe generations. Jailbreak disturbs the transformation of early unethical classification into negative emotions. We conduct experiments on models from 7B to 70B across various model families to prove our conclusion. Overall, our paper indicates the intrinsical mechanism of LLM safety and how jailbreaks circumvent safety guardrails, offering a new perspective on LLM safety and reducing concerns.
Alignment-Enhanced Decoding: Defending Jailbreaks via Token-Level Adaptive Refining of Probability Distributions
Quan Liu | Zhenhong Zhou | Longzhu He | Yi Liu | Wei Zhang | Sen Su
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Quan Liu | Zhenhong Zhou | Longzhu He | Yi Liu | Wei Zhang | Sen Su
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines Competitive Index and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach.
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu | Yishuo Cai | Zhenhong Zhou | Renjie Gu | Haiqin Weng | Liu Yan | Tianwei Zhang | Wei Xu | Han Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Rongwu Xu | Yishuo Cai | Zhenhong Zhou | Renjie Gu | Haiqin Weng | Liu Yan | Tianwei Zhang | Wei Xu | Han Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
The risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, , the model can steer away from generating harmful content autonomously. First, we introduce the C2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction.To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C2-Syn, a synthetic C2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.Experiments on Llama2-Chat 7B and Qwen2 7B show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs’ safety, particularly in resisting jailbreak attacks.
Search
Fix author
Co-authors
- Sen Su 5
- Yuanhe Zhang 4
- Liang Lin 2
- Yang Liu 2
- Kun Wang 2
- Xinyue Wang 2
- Rongwu Xu 2
- Shilinlu Yan 2
- Wei Zhang 2
- Moayad Aloqaily 1
- Yishuo Cai 1
- Pin-Yu Chen 1
- Xiuying Chen 1
- Haoran Gao 1
- Renjie Gu 1
- Longzhu He 1
- Fei Huang 1
- Yue Huang 1
- Xiaojun Jia 1
- Yu Jiang 1
- Yongbin Li 1
- Quan Liu 1
- Yi Liu 1
- Prakhar Mehrotra 1
- Fanyu Meng 1
- Linsey Pang 1
- Han Qiu 1
- Li Sun 1
- Jiayu Tian 1
- Wei Wang 1
- Wenjie Wang 1
- Yanbo Wang 1
- Qingsong Wen 1
- Haiqin Weng 1
- Wei Xu 1
- Liu Yan 1
- Haiyang Yu 1
- Miao Yu 1
- Tianwei Zhang 1
- Xiangliang Zhang 1
- Xinghua Zhang 1
- Yibo Zhang 1
- Yuyao Zhang 1
- Pengyu Zhu 1