Abstract
Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions.To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual–language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences model perception. Using MVP-Bench, we diagnose the visual perception of 10 open-source and 2 closed-source LVLMs, showing that high-level perception tasks significantly challenge existing LVLMs. The state-of-the-art GPT-4o only achieves an accuracy of 56% on Yes/No questions, compared with 74% in low-level scenarios. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do.- Anthology ID:
- 2024.findings-emnlp.789
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13505–13527
- Language:
- URL:
- https://aclanthology.org/2024.findings-emnlp.789
- DOI:
- 10.18653/v1/2024.findings-emnlp.789
- Cite (ACL):
- Guanzhen Li, Yuxi Xie, and Min-Yen Kan. 2024. MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans?. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13505–13527, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans? (Li et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.findings-emnlp.789.pdf