Jiahui Zhao
2026
Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety
Can Jin | Rui Wu | Tong Che | Qixin Zhang | Hongwu Peng | Jiahui Zhao | Zhenting Wang | Wenqi Wei | Ligong Han | Zhao Zhang | Yuan Cao | Ruixiang Tang | Dimitris N. Metaxas
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Can Jin | Rui Wu | Tong Che | Qixin Zhang | Hongwu Peng | Jiahui Zhao | Zhenting Wang | Wenqi Wei | Ligong Han | Zhao Zhang | Yuan Cao | Ruixiang Tang | Dimitris N. Metaxas
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed “code-like” safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.
2024
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
Linhao Yu | Yongqi Leng | Yufei Huang | Shang Wu | Haixin Liu | Xinmeng Ji | Jiahui Zhao | Jinwang Song | Tingting Cui | Xiaoqing Cheng | Tao Liu | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2024
Linhao Yu | Yongqi Leng | Yufei Huang | Shang Wu | Haixin Liu | Xinmeng Ji | Jiahui Zhao | Jinwang Song | Tingting Cui | Xiaoqing Cheng | Tao Liu | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2024
What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs.