Can Indirect Prompt Injection Attacks Be Detected and Removed?

Yulin Chen; Haoran Li; Yuan Sui; Yufei He; Yue Liu; Yangqiu Song; Bryan Hooi

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, Bryan Hooi

Abstract

Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after detection.In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation. For detection, we assess the performance of existing LLMs and open-source detection models, and we further train detection models using our crafted training datasets. For removal, we evaluate two intuitive methods: (1) the *segmentation removal method*, which segments the injected document and removes parts containing injected instructions, and (2) the *extraction removal method*, which trains an extraction model to identify and remove injected instructions.

Anthology ID:: 2025.acl-long.890
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18189–18206
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.890/
DOI:
Bibkey:
Cite (ACL):: Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, and Bryan Hooi. 2025. Can Indirect Prompt Injection Attacks Be Detected and Removed?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18189–18206, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Can Indirect Prompt Injection Attacks Be Detected and Removed? (Chen et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.890.pdf

PDF Cite Search Fix data