Can VLMs Actually See and Read? A Survey on Modality Collapse in Vision-Language Models

Mong Yuan Sim, Wei Emma Zhang, Xiang Dai, Biaoyan Fang


Abstract
Vision-language models (VLMs) integrate textual and visual information, enabling the model to process visual inputs and leverage visual information to generate predictions. Such models are demanding for tasks such as visual question answering, image captioning, and visual grounding. However, some recent work found that VLMs often rely heavily on textual information, ignoring visual information, but are still able to achieve competitive performance in vision-language (VL) tasks. This survey reviews modality collapse analysis work to provide insights into the reason for this unintended behavior. It also reviews probing studies for fine-grained vision-language understanding, presenting current findings on information encoded in VL representations and highlighting potential directions for future research.
Anthology ID:
2025.findings-acl.1256
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24452–24470
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.1256/
DOI:
Bibkey:
Cite (ACL):
Mong Yuan Sim, Wei Emma Zhang, Xiang Dai, and Biaoyan Fang. 2025. Can VLMs Actually See and Read? A Survey on Modality Collapse in Vision-Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24452–24470, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Can VLMs Actually See and Read? A Survey on Modality Collapse in Vision-Language Models (Sim et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.1256.pdf