RAP-ID: Mechanistic Prompt Injection Detection via Impostor Behavior Analysis

Yuchen Yang, Lei Peng, Yujie He, Yang yu, Zhongxin Wu, Yanlei Shi


Abstract
Large Language Models are increasingly integrated into critical applications, yet they remain vulnerable to prompt injection attacks where meticulously designed adversarial inputs bypass safety alignment. Existing defenses often rely on externally deployed guardrail models or response inspection, which incur significant computational overhead and latency. We propose RAP-ID (Robust Alignment Preservation via Injection Defense), a mechanistic, train-free detection framework that operates exclusively on internal state dynamics during the initial forward pass. RAP-ID identifies attacks by detecting their inevitable "impostor" behavior: they must mimic system instruction semantics (Directive Likeness), usurp attention from the true system prompt (Counterfactual Gain), and trigger latent risk concepts (Policy Conflict). By fusing these three internal signals, RAP-ID achieves effective detection across diverse attack vectors—from direct jailbreaks to stealthy agentic manipulations—without requiring text generation. Comprehensive evaluations demonstrate that RAP-ID achieves competitive performance with significant overall improvements compared to heuristic methods. Crucially, as a train-free solution, it incurs minimal computational overhead and delivers fast response times, making it well-suited for real-time deployment.
Anthology ID:
2026.findings-acl.738
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15008–15019
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.738/
DOI:
Bibkey:
Cite (ACL):
Yuchen Yang, Lei Peng, Yujie He, Yang yu, Zhongxin Wu, and Yanlei Shi. 2026. RAP-ID: Mechanistic Prompt Injection Detection via Impostor Behavior Analysis. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15008–15019, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
RAP-ID: Mechanistic Prompt Injection Detection via Impostor Behavior Analysis (Yang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.738.pdf
Checklist:
 2026.findings-acl.738.checklist.pdf