Yicheng Qian
2025
RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios
Fei Zhao
|
Chengqiang Lu
|
Yufan Shen
|
Qimeng Wang
|
Yicheng Qian
|
Haoxin Zhang
|
Yan Gao
|
Yiwu
|
Yao Hu
|
Zhen Wu
|
Shangyu Xing
|
Xinyu Dai
Findings of the Association for Computational Linguistics: EMNLP 2025
While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context. Our datasets will be publicly available.