Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Fenghua Weng; Jian Lou; Jun Feng; Minlie Huang; Wenjie Wang

doi:10.18653/v1/2025.findings-emnlp.735

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, Wenjie Wang

Abstract

Safety alignment is critical in pre-trained large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose Adversary-aware DPO (ADPO), a novel training framework that explicitly considers adversary. Adversary-aware DPO (ADPO) integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. ADPO introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversary-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, ADPO ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that ADPO outperforms baselines in terms of both safety alignment and general utility of VLMs.

Anthology ID:: 2025.findings-emnlp.735
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13644–13657
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.735/
DOI:: 10.18653/v1/2025.findings-emnlp.735
Bibkey:
Cite (ACL):: Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, and Wenjie Wang. 2025. Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13644–13657, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training (Weng et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.735.pdf
Checklist:: 2025.findings-emnlp.735.checklist.pdf

PDF Cite Search Checklist Fix data