QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Yu Wang


Abstract
In typical multimodal tasks, such as Visual Question Answering (VQA), adversarial attacks targeting a specific image and question can lead large vision-language models (LVLMs) to provide incorrect answers. However, it is common for a single image to be associated with multiple questions, and LVLMs may still answer other questions correctly even for an adversarial image attacked by a specific question. To address this, we introduce the query-agnostic visual attack (QAVA), which aims to create robust adversarial examples that generate incorrect responses to unspecified and unknown questions. Compared to traditional adversarial attacks focused on specific images and questions, QAVA significantly enhances the effectiveness and efficiency of attacks on images when the question is unknown, achieving performance comparable to attacks on known target questions. Our research broadens the scope of visual adversarial attacks on LVLMs in practical settings, uncovering previously overlooked vulnerabilities, particularly in the context of visual adversarial threats. The code is available at https://github.com/btzyd/qava.
Anthology ID:
2025.naacl-long.512
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10205–10218
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.512/
DOI:
Bibkey:
Cite (ACL):
Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, and Yu Wang. 2025. QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 10205–10218, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models (Zhang et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.512.pdf