Junqi Wang
2026
Simple Role Assignment is Extraordinarily Effective for Safety Alignment
Zhou Ziheng | Jiakun Ding | Zhaowei Zhang | Ruosen Gao | Ying Nian Wu | Demetri Terzopoulos | Yipeng Kang | Fangwei Zhong | Junqi Wang
Findings of the Association for Computational Linguistics: ACL 2026
Zhou Ziheng | Jiakun Ding | Zhaowei Zhang | Ruosen Gao | Ying Nian Wu | Demetri Terzopoulos | Yipeng Kang | Fangwei Zhong | Junqi Wang
Findings of the Association for Computational Linguistics: ACL 2026
Principle-based alignment often lacks context sensitivity and completeness. Grounded in Theory of Mind, we propose role conditioning as a compact alternative: social roles (e.g., mother, judge) implicitly encode both values and the cognitive schemas required to apply them. We introduce a training-free pipeline featuring a role-conditioned generator and iterative role-based critics for refinement. Across five model families, our approach consistently outperforms principle-based, Chain-of-Thought (CoT) and other baselines across benchmarks. Notably, it reduces unsafe outputs on the WildJailbreak benchmark from 81.4% to 3.6% with DeepSeek-V3. Not only for common safety benchmarks, it consistently applies for agentic safety tasks. These results establish role assignment as a powerful, interpretable paradigm for AI alignment and LLM-as-a-Judge construction.
2025
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective
Yipeng Kang | Junqi Wang | Yexin Li | Mengmeng Wang | Wenming Tu | Quansen Wang | Hengli Li | Tingjun Wu | Xue Feng | Fangwei Zhong | Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2025
Yipeng Kang | Junqi Wang | Yexin Li | Mengmeng Wang | Wenming Tu | Quansen Wang | Hengli Li | Tingjun Wu | Xue Feng | Fangwei Zhong | Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2025
As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained approach to value steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.