SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, Yu Tian
Abstract
Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the endogenous safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at both encoding and decoding stages. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it employs a lightweight discriminator, based on the MLLM’s own discriminative ability, to iteratively steer the decoding process toward safety. A safety alignment vector is also integrated to handle complex multimodal threats. Experiments on multiple MLLMs demonstrate that our proposed method can improve safety performance by up to 33.40% without fine-tuning.- Anthology ID:
- 2026.findings-acl.916
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 18411–18422
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.916/
- DOI:
- Cite (ACL):
- Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, and Yu Tian. 2026. SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 18411–18422, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models (Zeng et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.916.pdf