Detection of Adversarial Prompts with Model Predictive Entropy

Franziska Rubenbauer; Sebastian Steindl; Patrick Levi; Daniel Loebenberger; Ulrich Schäfer

Detection of Adversarial Prompts with Model Predictive Entropy

Franziska Rubenbauer, Sebastian Steindl, Patrick Levi, Daniel Loebenberger, Ulrich Schäfer

Abstract

Large Language Models (LLMs) are increasingly deployed in high-impact scenarios, raising concerns about their safety and security. Despite existing defense mechanisms, LLMs remain vulnerable to adversarial attacks. This paper introduces the novel attack-agnostic pipeline SENTRY (semantic entropy-based attack recognition system) for detecting such attacks by leveraging the predictive entropy of model outputs, quantified through the Token-Level Shifting Attention to Relevance (TokenSAR) score, a weighted token entropy measurement. Our approach dynamically identifies adversarial inputs without relying on prior knowledge of attack specifications. It requires only ten newly generated tokens, making it a computationally efficient and adaptable solution. We evaluate the pipeline on multiple state-of-the-art models, including Llama, Vicuna, Falcon, Deep Seek, and Mistral, using a diverse set of adversarial prompts generated via the h4rm31 framework. Experimental results demonstrate a clear separation in TokenSAR scores between benign, malicious, and adversarial prompts. This distinction enables effective threshold-based classification, achieving robust detection performance across various model architectures. Our method outperforms traditional defenses in terms of adaptability and resource efficiency.

Anthology ID:: 2026.findings-eacl.103
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1979–1993
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.103/
DOI:
Bibkey:
Cite (ACL):: Franziska Rubenbauer, Sebastian Steindl, Patrick Levi, Daniel Loebenberger, and Ulrich Schäfer. 2026. Detection of Adversarial Prompts with Model Predictive Entropy. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1979–1993, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Detection of Adversarial Prompts with Model Predictive Entropy (Rubenbauer et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.103.pdf
Checklist:: 2026.findings-eacl.103.checklist.pdf

PDF Cite Search Checklist Fix data