Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

Yulin Chen; Haoran Li; Yuan Sui; Yue Liu; Yufei He; Xiaoling Bai; Chi Fei; Li Yabo; Haozhe Ma; Yangqiu Song; Bryan Hooi

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

Yulin Chen, Haoran Li, Yuan Sui, Yue Liu, Yufei He, Xiaoling Bai, Chi Fei, Li Yabo, Haozhe Ma, Yangqiu Song, Bryan Hooi

Abstract

Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. Currently, various prompt injection defense methods have been proposed, including prompt-engineering-based approaches and fine-tuning methods. Most of these methods instruct the model to follow the original input instructions, suppressing its inherent tendencies to follow the injected instructions. However, experimental results reveal that suppressing the model’s instruction-following tendencies is challenging. After analyzing successful attack cases, we find that the LLMs can correctly reference the instructions they are executing in some cases. Motivated by this finding, we propose a defense method that leverages LLMs’ instruction-following abilities rather than suppressing them. Our approach prompts LLMs to generate responses that include both the answers and their corresponding instruction references. Based on these references, we filter out answers whose references are not to the original input instructions. We conduct comprehensive experiments to evaluate the effectiveness of our proposed method. The results show that our approach outperforms prompt-engineering-based baselines and is comparable to fine-tuning methods, reducing the ASR to nearly 0% in some scenarios. Moreover, our approach has minimal impact on overall utility.

Anthology ID:: 2026.findings-acl.61
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1194–1215
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.61/
DOI:
Bibkey:
Cite (ACL):: Yulin Chen, Haoran Li, Yuan Sui, Yue Liu, Yufei He, Xiaoling Bai, Chi Fei, Li Yabo, Haozhe Ma, Yangqiu Song, and Bryan Hooi. 2026. Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction. In Findings of the Association for Computational Linguistics: ACL 2026, pages 1194–1215, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction (Chen et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.61.pdf
Checklist:: 2026.findings-acl.61.checklist.pdf

PDF Cite Search Checklist Fix data