DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

Jianyu Liu; Hangyu Guo; Ranjie Duan; Xingyuan Bu; Yancheng He; Shilong Li; Hui Huang; Jiaheng Liu; Yucheng Wang; Chenchen Jing; Xingwei Qu; Xiao Zhang (张晓); Pei Wang; Yanan Wu; Jihao Gu; Yangguang Li; Jianke Zhu

DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Pei Wang, Yanan Wu, Jihao Gu, Yangguang Li, Jianke Zhu

Abstract

Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce DREAM ( Disentangling Risks to Enhance Safety Alignment in MLLMs), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17% improvement in the SIUO safe&effective score compared to GPT-4V.

Anthology ID:: 2025.naacl-long.604
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12097–12118
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.604/
DOI:
Bibkey:
Cite (ACL):: Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Pei Wang, Yanan Wu, Jihao Gu, Yangguang Li, and Jianke Zhu. 2025. DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12097–12118, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models (Liu et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.604.pdf

PDF Cite Search Fix data