MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

Siwei Wu, King Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, Jiaheng Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, Chenghua Lin


Abstract
Current multi-modal benchmarks primarily focus on facts within individual images. However, they overlook the associative relations among multiple images, which necessitate conducting commonsense reasoning grounded in associated knowledge at different granularities (i.e., image-level and entity-level) as well as the ability to perceive the order of images. Therefore, we propose a multi-image relational association task and a meticulously curated Multi-granularity Multi-image Relational Association (MMRA) benchmark, comprising 1,024 samples. To systematically evaluate current LVLMs, we establish a system of associative relations among images that contains 11 subtasks (e.g., UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., image-level and entity-level), based on relations in ConceptNet. Our experiments reveal that entity-level multi-image perception tasks pose greater challenges for LVLMs than image-level tasks. Moreover, LVLMs perform poorly on spatial-related tasks, indicating limited spatial awareness. Furthermore, we find that LVLMs exhibit weak image order perception capabilities, and we design a method to significantly improve this ability, demonstrating that most current LVLMs do not adequately consider image order perception during pre-training.
Anthology ID:
2026.findings-eacl.287
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5405–5419
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.287/
DOI:
Bibkey:
Cite (ACL):
Siwei Wu, King Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, Jiaheng Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, and Chenghua Lin. 2026. MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5405–5419, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models (Wu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.287.pdf
Checklist:
 2026.findings-eacl.287.checklist.pdf