Attack as Defense: Safeguarding Large Vision-Language Models from Jailbreaking by Adversarial Attacks

Chongxin Li, Hanzhang Wang, Yuchun Fang


Abstract
Adversarial vulnerabilities in vision-language models pose a critical challenge to the reliability of large language systems, where typographic manipulations and adversarial perturbations can effectively bypass language model defenses. We introduce Attack as Defense (AsD), the first approach to proactively defend at the cross-modality level, embedding protective perturbations in vision to disrupt attacks before they propagate to the language model. By leveraging the semantic alignment between vision and language, AsD enhances adversarial robustness through model perturbations and system-level prompting. Unlike prior work that focuses on text-stage defenses, our method integrates visual defenses to reinforce prompt-based protections, mitigating jailbreaking attacks across benchmarks. Experiments on the LLaVA-1.5 show that AsD reduces attack success rates from 56.7% to 12.6% for typographic attacks and from 89.0% to 47.5% for adversarial perturbations. Further analysis reveals that the key bottleneck in vision-language security lies not in isolated model vulnerabilities, but in cross-modal interactions, where adversarial cues in the vision model fail to consistently activate the defense mechanisms of the language model.
Anthology ID:
2025.findings-emnlp.1095
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20138–20152
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1095/
DOI:
10.18653/v1/2025.findings-emnlp.1095
Bibkey:
Cite (ACL):
Chongxin Li, Hanzhang Wang, and Yuchun Fang. 2025. Attack as Defense: Safeguarding Large Vision-Language Models from Jailbreaking by Adversarial Attacks. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20138–20152, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Attack as Defense: Safeguarding Large Vision-Language Models from Jailbreaking by Adversarial Attacks (Li et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1095.pdf
Checklist:
 2025.findings-emnlp.1095.checklist.pdf