Lei Peng

2026

Large Language Models are increasingly integrated into critical applications, yet they remain vulnerable to prompt injection attacks where meticulously designed adversarial inputs bypass safety alignment. Existing defenses often rely on externally deployed guardrail models or response inspection, which incur significant computational overhead and latency. We propose RAP-ID (Robust Alignment Preservation via Injection Defense), a mechanistic, train-free detection framework that operates exclusively on internal state dynamics during the initial forward pass. RAP-ID identifies attacks by detecting their inevitable "impostor" behavior: they must mimic system instruction semantics (Directive Likeness), usurp attention from the true system prompt (Counterfactual Gain), and trigger latent risk concepts (Policy Conflict). By fusing these three internal signals, RAP-ID achieves effective detection across diverse attack vectors—from direct jailbreaks to stealthy agentic manipulations—without requiring text generation. Comprehensive evaluations demonstrate that RAP-ID achieves competitive performance with significant overall improvements compared to heuristic methods. Crucially, as a train-free solution, it incurs minimal computational overhead and delivers fast response times, making it well-suited for real-time deployment.

Co-authors

Venues

Findings1

Fix author