Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward

Zhiyuan Fan, Yumeng Wang, Sandeep Polisetty, Yi R. Fung


Abstract
Large Vision Language Models (LVLMs) have shown impressive performance on various vision-language tasks. However, while objects in natural scenes inevitably exhibit visual variations in position, scale, orientation, and context due to changes in viewpoint and environment, the robustness of LVLMs to these fundamental visual variations remains largely unexplored. To address this gap, we introduce V²R-Bench, a comprehensive benchmark framework for evaluating Visual Variation Robustness of LVLMs, which encompasses automated evaluation dataset generation and principled metrics for thorough robustness assessment. Through extensive evaluation of 13 LVLMs, we reveal a surprising vulnerability to visual variations, affecting even advanced models that excel at complex vision-language tasks yet significantly underperform on simple tasks like object recognition. Interestingly, these models exhibit a distinct visual position bias that contradicts theories of effective receptive fields and demonstrate a human-like visual acuity threshold. To identify the source of these vulnerabilities, we propose a systematic framework for component-level analysis, featuring a novel visualization approach for aligned visual features. Results show that these vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment. Complementary experiments with synthetic data further demonstrate that these limitations are fundamentally architectural challenges, underscoring the need for architectural innovations in future LVLM designs.
Anthology ID:
2025.findings-acl.1037
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20222–20242
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1037/
DOI:
Bibkey:
Cite (ACL):
Zhiyuan Fan, Yumeng Wang, Sandeep Polisetty, and Yi R. Fung. 2025. Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward. In Findings of the Association for Computational Linguistics: ACL 2025, pages 20222–20242, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward (Fan et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1037.pdf