Xi Yang
Other people with similar names: Xi Yang
Unverified author pages with similar names: Xi Yang
2026
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries
Ki Sen Hung | Xi Yang | Chang Liu | Haoran Li | Kejiang Chen | Changxuan Fan | Tsun On Kwok | Weiming Zhang | Xiaomeng Li | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ki Sen Hung | Xi Yang | Chang Liu | Haoran Li | Kejiang Chen | Changxuan Fan | Tsun On Kwok | Weiming Zhang | Xiaomeng Li | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.
GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety
Changxuan Fan | Xi Yang | Yueyuan Zheng | Bin Zhou | Yuanping Wang | Wenbin Hu | Huihao Jing | Ki Sen Hung | Dazhao Du | Haoran Li | Janet Hui-wen Hsiao | Yangqiu Song
Findings of the Association for Computational Linguistics: ACL 2026
Changxuan Fan | Xi Yang | Yueyuan Zheng | Bin Zhou | Yuanping Wang | Wenbin Hu | Huihao Jing | Ki Sen Hung | Dazhao Du | Haoran Li | Janet Hui-wen Hsiao | Yangqiu Song
Findings of the Association for Computational Linguistics: ACL 2026
As older adults increasingly use LLM-based chatbots for companionship and assistance, a safety gap is emerging. Older adults may face vulnerabilities from social isolation, limited digital literacy, and cognitive decline, yet existing safety benchmarks largely target general harms and overlook elderly-specific risks. For example, a prompt such as “how to repair a ceiling light alone in the dark” may be benign for most users but poses a serious fall risk for older adults with mobility limitations.We introduce GrandGuard, the first comprehensive framework for assessing and mitigating elderly-specific contextual risks in LLM interactions. We develop a three-level taxonomy with 50 fine-grained risk types across mental well-being, financial, medical, toxicity, and privacy domains, grounded in real-world incidents, community discussions, and analysis of stakeholder studies. Using this taxonomy, we construct a benchmark of 10,404 labeled prompts and responses, showing that several leading LLMs mishandle elderly-specific contextual risks in over 50% of cases. We mitigate these failures with two safeguards: a fine-tuned Llama-Guard-3 and a policy-enhanced gpt-oss-safeguard-20b, achieving up to 96.2% and 90.9% unsafe-prompt detection accuracy, respectively. GrandGuard lays the groundwork for AI systems that move beyond general safety to support aging populations.