Lu Yan
2025
System Prompt Hijacking via Permutation Triggers in LLM Supply Chains
Lu Yan
|
Siyuan Cheng
|
Xuan Chen
|
Kaiyuan Zhang
|
Guangyu Shen
|
Xiangyu Zhang
Findings of the Association for Computational Linguistics: ACL 2025
LLMs are increasingly developed through distributed supply chains, where model providers create base models that deployers customize with system prompts for task-specific applications and safety alignment. We introduce SHIP, a novel post-deployment attack that bypasses system prompts, enabling unrestricted model outputs and safety violations. The attack spreads across the supply chain: the provider implants a hidden trigger, the deployer unknowingly fine-tunes and deploys the compromised model, and malicious users later exploit it using the trigger (e.g., obtained via underground market), as real-world software supply chain breaches. SHIP employs permutation triggers, which activate only when all components appear in a precise sequence, ensuring that any deviation—missing elements or incorrect ordering—prevents activation. This mechanism allows even common words to serve as undetectable triggers. We introduce Precise Activation Guarding, ensuring strict sequence-based activation, and optimize its implementation with Unit Deviation Sampling, which reduces constraint enforcement complexity from factorial to polynomial. Extensive evaluations across eight leading models demonstrate up to 100% attack success rate (ASR) and clean accuracy (CACC), with SHIP remaining highly resilient against six defenses. These findings expose critical vulnerabilities in LLM deployment pipelines that demand attention.
2024
UDAA: An Unsupervised Domain Adaptation Adversarial Learning Framework for Zero-Resource Cross-Domain Named Entity Recognition
Li Baofeng
|
Tang Jianguo
|
Qin Yu
|
Xu Yuelou
|
Lu Yan
|
Wang Kai
|
Li Lei
|
Zhou Yanquan
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“The zero-resource cross-domain named entity recognition (NER) task aims to perform NER in aspecific domain where labeled data is unavailable. Existing methods primarily focus on transfer-ring NER knowledge from high-resource to zero-resource domains. However, the challenge liesin effectively transferring NER knowledge between domains due to the inherent differences inentity structures across domains. To tackle this challenge, we propose an Unsupervised DomainAdaptation Adversarial (UDAA) framework, which combines the masked language model auxil-iary task with the domain adaptive adversarial network to mitigate inter-domain differences andefficiently facilitate knowledge transfer. Experimental results on CBS, Twitter, and WNUT2016three datasets demonstrate the effectiveness of our framework. Notably, we achieved new state-of-the-art performance on the three datasets. Our code will be released.Introduction”