RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios

Fei Zhao, Chengqiang Lu, Yufan Shen, Qimeng Wang, Yicheng Qian, Haoxin Zhang, Yan Gao, Yiwu, Yao Hu, Zhen Wu, Shangyu Xing, Xinyu Dai


Abstract
While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context. Our datasets will be publicly available.
Anthology ID:
2025.findings-emnlp.1039
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19097–19115
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1039/
DOI:
10.18653/v1/2025.findings-emnlp.1039
Bibkey:
Cite (ACL):
Fei Zhao, Chengqiang Lu, Yufan Shen, Qimeng Wang, Yicheng Qian, Haoxin Zhang, Yan Gao, Yiwu, Yao Hu, Zhen Wu, Shangyu Xing, and Xinyu Dai. 2025. RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19097–19115, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios (Zhao et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1039.pdf
Checklist:
 2025.findings-emnlp.1039.checklist.pdf