Detection of Adversarial Prompts with Model Predictive Entropy

Franziska Rubenbauer, Sebastian Steindl, Patrick Levi, Daniel Loebenberger, Ulrich Schäfer


Abstract
Large Language Models (LLMs) are increasingly deployed in high-impact scenarios, raising concerns about their safety and security. Despite existing defense mechanisms, LLMs remain vulnerable to adversarial attacks. This paper introduces the novel attack-agnostic pipeline SENTRY (semantic entropy-based attack recognition system) for detecting such attacks by leveraging the predictive entropy of model outputs, quantified through the Token-Level Shifting Attention to Relevance (TokenSAR) score, a weighted token entropy measurement. Our approach dynamically identifies adversarial inputs without relying on prior knowledge of attack specifications. It requires only ten newly generated tokens, making it a computationally efficient and adaptable solution. We evaluate the pipeline on multiple state-of-the-art models, including Llama, Vicuna, Falcon, Deep Seek, and Mistral, using a diverse set of adversarial prompts generated via the h4rm31 framework. Experimental results demonstrate a clear separation in TokenSAR scores between benign, malicious, and adversarial prompts. This distinction enables effective threshold-based classification, achieving robust detection performance across various model architectures. Our method outperforms traditional defenses in terms of adaptability and resource efficiency.
Anthology ID:
2026.findings-eacl.103
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1979–1993
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.103/
DOI:
Bibkey:
Cite (ACL):
Franziska Rubenbauer, Sebastian Steindl, Patrick Levi, Daniel Loebenberger, and Ulrich Schäfer. 2026. Detection of Adversarial Prompts with Model Predictive Entropy. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1979–1993, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Detection of Adversarial Prompts with Model Predictive Entropy (Rubenbauer et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.103.pdf
Checklist:
 2026.findings-eacl.103.checklist.pdf