Detection of Adversarial Prompts with Model Predictive Entropy
Franziska Rubenbauer, Sebastian Steindl, Patrick Levi, Daniel Loebenberger, Ulrich Schäfer
Abstract
Large Language Models (LLMs) are increasingly deployed in high-impact scenarios, raising concerns about their safety and security. Despite existing defense mechanisms, LLMs remain vulnerable to adversarial attacks. This paper introduces the novel attack-agnostic pipeline SENTRY (semantic entropy-based attack recognition system) for detecting such attacks by leveraging the predictive entropy of model outputs, quantified through the Token-Level Shifting Attention to Relevance (TokenSAR) score, a weighted token entropy measurement. Our approach dynamically identifies adversarial inputs without relying on prior knowledge of attack specifications. It requires only ten newly generated tokens, making it a computationally efficient and adaptable solution. We evaluate the pipeline on multiple state-of-the-art models, including Llama, Vicuna, Falcon, Deep Seek, and Mistral, using a diverse set of adversarial prompts generated via the h4rm31 framework. Experimental results demonstrate a clear separation in TokenSAR scores between benign, malicious, and adversarial prompts. This distinction enables effective threshold-based classification, achieving robust detection performance across various model architectures. Our method outperforms traditional defenses in terms of adaptability and resource efficiency.- Anthology ID:
- 2026.findings-eacl.103
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1979–1993
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.103/
- DOI:
- Cite (ACL):
- Franziska Rubenbauer, Sebastian Steindl, Patrick Levi, Daniel Loebenberger, and Ulrich Schäfer. 2026. Detection of Adversarial Prompts with Model Predictive Entropy. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1979–1993, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Detection of Adversarial Prompts with Model Predictive Entropy (Rubenbauer et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.103.pdf