Ziwei Wang

Other people with similar names: Ziwei Wang

Unverified author pages with similar names: Ziwei Wang

2026

With the widespread deployment of large language models (LLMs), existing safety benchmarks remain largely focused on explicitly harmful content, overlooking context-dependent expressions such as dogwhistles, the language that conveys harmful intent while appearing benign on the surface. To address this gap, we introduce DogBench, a comprehensive benchmark for evaluating LLM safety under dogwhistle-driven prompts. DogBench comprises 11,150 prompt instances constructed from controlled templates that embed dogwhistle terms, allowing for enabling direct comparison with explicit toxic terms under identical prompt structures. Each prompt is further annotated with pragmatic attributes, including interaction category and stance tendency. Extensive evaluations across multiple mainstream LLMs reveal a consistent pattern: dogwhistle prompts are substantially more likely to elicit harmful outputs than their explicit toxic counterparts, with an average risk increase of approximately fourfold. These findings expose a blind spot in current safety evaluation and alignment practices. Our work underscores the need to explicitly incorporate dogwhistles into future LLM safety research, with DogBench serving as a dedicated benchmark for this purpose.

Co-authors

Haiyan Wu 1

Xin Yao 1

Jiaxin Zhang 1

Xiangyu Zhao 1

Venues

Findings1

Fix author