Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

Jinhwa Kim; Ian Harris

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

Abstract

While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 92% while maintaining the original LLMs’ performance, achieving state-of-the-art Safety and Helpfulness balance. Notably, Context Filtering is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves.

Anthology ID:: 2026.trustnlp-main.29
Volume:: Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 439–455
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.29/
DOI:
Bibkey:
Cite (ACL):: Jinhwa Kim and Ian Harris. 2026. Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 439–455, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs (Kim & Harris, TrustNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.29.pdf

PDF Cite Search Fix data