CoLLaVO: Crayon Large Language and Vision mOdel

Byung-Kwan Lee; Beomchan Park; Chae Won Kim; Yong Man Ro

doi:10.18653/v1/2024.findings-acl.66

CoLLaVO: Crayon Large Language and Vision mOdel

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

Abstract

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

Anthology ID:: 2024.findings-acl.66
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1121–1138
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-acl.66/
DOI:: 10.18653/v1/2024.findings-acl.66
Bibkey:
Cite (ACL):: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. 2024. CoLLaVO: Crayon Large Language and Vision mOdel. In Findings of the Association for Computational Linguistics: ACL 2024, pages 1121–1138, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: CoLLaVO: Crayon Large Language and Vision mOdel (Lee et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-acl.66.pdf

PDF Cite Search Fix data