SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, Yu Tian


Abstract
Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the endogenous safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at both encoding and decoding stages. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it employs a lightweight discriminator, based on the MLLM’s own discriminative ability, to iteratively steer the decoding process toward safety. A safety alignment vector is also integrated to handle complex multimodal threats. Experiments on multiple MLLMs demonstrate that our proposed method can improve safety performance by up to 33.40% without fine-tuning.
Anthology ID:
2026.findings-acl.916
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18411–18422
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.916/
DOI:
Bibkey:
Cite (ACL):
Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, and Yu Tian. 2026. SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 18411–18422, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models (Zeng et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.916.pdf
Checklist:
 2026.findings-acl.916.checklist.pdf