MSR2: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering

Kuo-Han Hung, Hung-Chieh Fang, Chao-Wei Huang, Yun-Nung Chen


Abstract
This paper introduces MSR2, a benchmark for multi-source retrieval and reasoning in visual question answering. Unlike previous knowledge-based visual question answering datasets, MSR2 focuses on questions involving multiple fine-grained entities, providing a unique opportunity to assess a model’s spatial reasoning ability and its capacity to retrieve and aggregate information from various sources for different entities. Through comprehensive evaluation using MSR2, we gain valuable insights into the capabilities and limitations of state-of-the-art large vision-language models (LVLMs).Our findings reveal that even state-of-the-art LVLMs struggle with questions requiring multi-entities and knowledge-intensive reasoning, highlighting important new directions for future research.Additionally, we demonstrate that enhanced visual entity recognition and knowledge retrieval can significantly improve performance on MSR2, pinpointing key areas for advancement.
Anthology ID:
2025.knowledgenlp-1.24
Volume:
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico, USA
Editors:
Weijia Shi, Wenhao Yu, Akari Asai, Meng Jiang, Greg Durrett, Hannaneh Hajishirzi, Luke Zettlemoyer
Venues:
KnowledgeNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
259–271
Language:
URL:
https://preview.aclanthology.org/moar-dois/2025.knowledgenlp-1.24/
DOI:
10.18653/v1/2025.knowledgenlp-1.24
Bibkey:
Cite (ACL):
Kuo-Han Hung, Hung-Chieh Fang, Chao-Wei Huang, and Yun-Nung Chen. 2025. MSR2: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering. In Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 259–271, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
MSR2: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering (Hung et al., KnowledgeNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/moar-dois/2025.knowledgenlp-1.24.pdf