AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents

Zhun Wang; Vincent Siu; Zhe Ye; Tianneng Shi; Yuzhou Nie; Xuandong Zhao; Chenguang Wang (王晨光); Wenbo Guo; Dawn Song

doi:10.18653/v1/2025.findings-emnlp.1258

AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, Dawn Song

Abstract

There emerges a critical security risk of LLM agents: indirect prompt injection, a sophisticated attack vector that compromises thecore of these agents, the LLM, by manipulating contextual information rather than direct user prompts. In this work, we propose a generic black-box optimization framework, AGENTVIGIL, designed to automatically discover and exploit indirect prompt injection vulnerabilities across diverse LLM agents. Our approach starts by constructing a high-quality initial seed corpus, then employs a seed selectionalgorithm based on Monte Carlo Tree Search (MCTS) to iteratively refine inputs, therebymaximizing the likelihood of uncovering agent weaknesses. We evaluate AGENTVIGIL on twopublic benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o, respectively, nearly doubling the performance of handcrafted baseline attacks. Moreover, AGENTVIGIL exhibits strong transferability across unseen tasks and internal LLMs, as well as promising results against defenses. Beyondbenchmark evaluations, we apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs,including malicious sites.

Anthology ID:: 2025.findings-emnlp.1258
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23159–23172
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1258/
DOI:: 10.18653/v1/2025.findings-emnlp.1258
Bibkey:
Cite (ACL):: Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23159–23172, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1258.pdf
Checklist:: 2025.findings-emnlp.1258.checklist.pdf

PDF Cite Search Checklist Fix data