False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang, Zeming Wei, Qin Liu, Wenxuan Zhou, Muhao Chen


Abstract
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs’ internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction.
Anthology ID:
2026.findings-acl.1300
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26100–26113
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1300/
DOI:
Bibkey:
Cite (ACL):
Cheng Wang, Zeming Wei, Qin Liu, Wenxuan Zhou, and Muhao Chen. 2026. False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize. In Findings of the Association for Computational Linguistics: ACL 2026, pages 26100–26113, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize (Wang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1300.pdf
Checklist:
 2026.findings-acl.1300.checklist.pdf