RAP-ID: Mechanistic Prompt Injection Detection via Impostor Behavior Analysis
Yuchen Yang, Lei Peng, Yujie He, Yang yu, Zhongxin Wu, Yanlei Shi
Abstract
Large Language Models are increasingly integrated into critical applications, yet they remain vulnerable to prompt injection attacks where meticulously designed adversarial inputs bypass safety alignment. Existing defenses often rely on externally deployed guardrail models or response inspection, which incur significant computational overhead and latency. We propose RAP-ID (Robust Alignment Preservation via Injection Defense), a mechanistic, train-free detection framework that operates exclusively on internal state dynamics during the initial forward pass. RAP-ID identifies attacks by detecting their inevitable "impostor" behavior: they must mimic system instruction semantics (Directive Likeness), usurp attention from the true system prompt (Counterfactual Gain), and trigger latent risk concepts (Policy Conflict). By fusing these three internal signals, RAP-ID achieves effective detection across diverse attack vectors—from direct jailbreaks to stealthy agentic manipulations—without requiring text generation. Comprehensive evaluations demonstrate that RAP-ID achieves competitive performance with significant overall improvements compared to heuristic methods. Crucially, as a train-free solution, it incurs minimal computational overhead and delivers fast response times, making it well-suited for real-time deployment.- Anthology ID:
- 2026.findings-acl.738
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15008–15019
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.738/
- DOI:
- Cite (ACL):
- Yuchen Yang, Lei Peng, Yujie He, Yang yu, Zhongxin Wu, and Yanlei Shi. 2026. RAP-ID: Mechanistic Prompt Injection Detection via Impostor Behavior Analysis. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15008–15019, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- RAP-ID: Mechanistic Prompt Injection Detection via Impostor Behavior Analysis (Yang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.738.pdf