Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee


Abstract
Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.
Anthology ID:
2026.acl-long.881
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19291–19315
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.881/
DOI:
Bibkey:
Cite (ACL):
Changin Choi, Wonseok Lee, Jungmin Ko, and Wonjong Rhee. 2026. Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19291–19315, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering (Choi et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.881.pdf
Checklist:
 2026.acl-long.881.checklist.pdf