MSR2: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering

Kuo-Han Hung; Hung-Chieh Fang; Chao-Wei Huang; Yun-Nung Chen

doi:10.18653/v1/2025.knowledgenlp-1.24

MSR²: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering

Kuo-Han Hung, Hung-Chieh Fang, Chao-Wei Huang, Yun-Nung Chen

Abstract

This paper introduces MSR², a benchmark for multi-source retrieval and reasoning in visual question answering. Unlike previous knowledge-based visual question answering datasets, MSR² focuses on questions involving multiple fine-grained entities, providing a unique opportunity to assess a model’s spatial reasoning ability and its capacity to retrieve and aggregate information from various sources for different entities. Through comprehensive evaluation using MSR², we gain valuable insights into the capabilities and limitations of state-of-the-art large vision-language models (LVLMs).Our findings reveal that even state-of-the-art LVLMs struggle with questions requiring multi-entities and knowledge-intensive reasoning, highlighting important new directions for future research.Additionally, we demonstrate that enhanced visual entity recognition and knowledge retrieval can significantly improve performance on MSR², pinpointing key areas for advancement.

Anthology ID:: 2025.knowledgenlp-1.24
Volume:: Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico, USA
Editors:: Weijia Shi, Wenhao Yu, Akari Asai, Meng Jiang, Greg Durrett, Hannaneh Hajishirzi, Luke Zettlemoyer
Venues:: KnowledgeNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 259–271
Language:
URL:: https://preview.aclanthology.org/moar-dois/2025.knowledgenlp-1.24/
DOI:: 10.18653/v1/2025.knowledgenlp-1.24
Bibkey:
Cite (ACL):: Kuo-Han Hung, Hung-Chieh Fang, Chao-Wei Huang, and Yun-Nung Chen. 2025. MSR2: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering. In Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 259–271, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: MSR2: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering (Hung et al., KnowledgeNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/moar-dois/2025.knowledgenlp-1.24.pdf

PDF Cite Search Fix data