Zhun Wang
2025
COSMIC: Generalized Refusal Direction Identification in LLM Activations
Vincent Siu
|
Nicholas Crispino
|
Zihao Yu
|
Sam Pan
|
Zhun Wang
|
Yang Liu
|
Dawn Song
|
Chenguang Wang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models encode behaviors like refusal within their activation space, but identifying these behaviors remains challenging. Existing methods depend on predefined refusal templates detectable in output tokens or manual review. We introduce **COSMIC** (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that optimally identifies steering directions and target layers using cosine similarity, entirely independent of output text. COSMIC achieves steering effectiveness comparable to prior work without any prior knowledge or assumptions of a model’s refusal behavior such as the use of certain refusal tokens. Additionally, COSMIC successfully identifies refusal directions in adversarial scenarios and models with weak safety alignment, demonstrating its robustness across diverse settings.
AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents
Zhun Wang
|
Vincent Siu
|
Zhe Ye
|
Tianneng Shi
|
Yuzhou Nie
|
Xuandong Zhao
|
Chenguang Wang
|
Wenbo Guo
|
Dawn Song
Findings of the Association for Computational Linguistics: EMNLP 2025
There emerges a critical security risk of LLM agents: indirect prompt injection, a sophisticated attack vector that compromises thecore of these agents, the LLM, by manipulating contextual information rather than direct user prompts. In this work, we propose a generic black-box optimization framework, AGENTVIGIL, designed to automatically discover and exploit indirect prompt injection vulnerabilities across diverse LLM agents. Our approach starts by constructing a high-quality initial seed corpus, then employs a seed selectionalgorithm based on Monte Carlo Tree Search (MCTS) to iteratively refine inputs, therebymaximizing the likelihood of uncovering agent weaknesses. We evaluate AGENTVIGIL on twopublic benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o, respectively, nearly doubling the performance of handcrafted baseline attacks. Moreover, AGENTVIGIL exhibits strong transferability across unseen tasks and internal LLMs, as well as promising results against defenses. Beyondbenchmark evaluations, we apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs,including malicious sites.
Search
Fix author
Co-authors
- Vincent Siu 2
- Dawn Song 2
- Chenguang Wang (王晨光) 2
- Nicholas Crispino 1
- Wenbo Guo 1
- show all...