Abstract
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.- Anthology ID:
- 2024.findings-acl.66
- Volume:
- Findings of the Association for Computational Linguistics ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand and virtual meeting
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1121–1138
- Language:
- URL:
- https://aclanthology.org/2024.findings-acl.66
- DOI:
- Cite (ACL):
- Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. 2024. CoLLaVO: Crayon Large Language and Vision mOdel. In Findings of the Association for Computational Linguistics ACL 2024, pages 1121–1138, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Cite (Informal):
- CoLLaVO: Crayon Large Language and Vision mOdel (Lee et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.findings-acl.66.pdf