GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Yi Zong, Xipeng Qiu


Abstract
The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing datasets either focus solely on primary perception abilities and commonsense knowledge, or have a low level of text comprehension difficulty, which are insufficient to reflect the comprehensive capabilities of LVLMs, particularly in terms of Chinese language proficiency. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model’s abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vision (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs. The dataset and evaluation code are available through: https://github.com/OpenMOSS/GAOKAO-MM
Anthology ID:
2024.findings-acl.521
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8817–8825
Language:
URL:
https://aclanthology.org/2024.findings-acl.521
DOI:
10.18653/v1/2024.findings-acl.521
Bibkey:
Cite (ACL):
Yi Zong and Xipeng Qiu. 2024. GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 8817–8825, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation (Zong & Qiu, Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.findings-acl.521.pdf