Victor Shea-Jay Huang
2026
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
Zhexin Zhang | Xian Qi Loye | Victor Shea-Jay Huang | Junxiao Yang | Qi Zhu | Shiyao Cui | Fei Mi | Lifeng Shang | Yingkang Wang | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhexin Zhang | Xian Qi Loye | Victor Shea-Jay Huang | Junxiao Yang | Qi Zhu | Shiyao Cui | Fei Mi | Lifeng Shang | Yingkang Wang | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance—and in some cases, may even degrade it. This raises an important research question: how should we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify five key risky patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we conduct a comprehensive ablation study to reveal the impact of different training configurations. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs.
The Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image Reasoning
Renmiao Chen | Yida Lu | Shiyao Cui | Xuan Ouyang | Victor Shea-Jay Huang | Shumin Zhang | Chengwei Pan | Han Qiu | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Renmiao Chen | Yida Lu | Shiyao Cui | Xuan Ouyang | Victor Shea-Jay Huang | Shumin Zhang | Chengwei Pan | Han Qiu | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints.