Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Ki Sen Hung; Xi Yang; Chang Liu; Haoran Li; Kejiang Chen; Changxuan Fan; Tsun On Kwok; Weiming Zhang; Xiaomeng Li; Yangqiu Song

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li, Kejiang Chen, Changxuan Fan, Tsun On Kwok, Weiming Zhang, Xiaomeng Li, Yangqiu Song

Abstract

A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.

Anthology ID:: 2026.acl-long.1139
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24830–24867
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1139/
DOI:
Bibkey:
Cite (ACL):: Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li, Kejiang Chen, Changxuan Fan, Tsun On Kwok, Weiming Zhang, Xiaomeng Li, and Yangqiu Song. 2026. Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24830–24867, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries (Hung et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1139.pdf
Checklist:: 2026.acl-long.1139.checklist.pdf

PDF Cite Search Checklist Fix data