Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs

Sihang Zhao, Youliang Yuan, Xiaoying Tang, Pinjia He


Abstract
Multimodal Large Language Models (MLLMs) demonstrate a strong understanding of the real world and can even handle complex tasks. However, they still fail on some straightforward visual question-answering (VQA) problems. This paper dives deeper into this issue, revealing that models tend to err when answering easy questions (e.g., Yes/No questions) about an image, even though they can correctly describe it.We refer to this model behavior discrepancy between difficult and simple questions as model laziness.To systematically investigate model laziness, we manually construct LazyBench, a benchmark that includes Yes/No, multiple choice, short answer questions, and image description tasks that are related to the same subjects in the images.Based on LazyBench. we observe that laziness widely exists in current advanced MLLMs (e.g., GPT-4o, Gemini-1.5-pro, Claude 3, LLaVA-1.5, LLaVA-1.6, and QWen-VL). We also analyzed the failure cases of LLaVA-1.5-13B on the VQA-v2 benchmark and discovered that about half of these failures are due to the model’s laziness. This further highlights the importance of ensuring that the model fully utilizes its capability.To this end, we conduct a preliminary exploration of how to mitigate laziness and find that chain of thought can effectively avoid this issue. The data can be accessed at https://github.com/Akutagawa1998/LazyBench.
Anthology ID:
2024.findings-emnlp.442
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7535–7548
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.442/
DOI:
10.18653/v1/2024.findings-emnlp.442
Bibkey:
Cite (ACL):
Sihang Zhao, Youliang Yuan, Xiaoying Tang, and Pinjia He. 2024. Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7535–7548, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs (Zhao et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.442.pdf
Data:
 2024.findings-emnlp.442.data.zip