Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models

Guanghui Ye; Huan Zhao; Zhixue Zhao; Xupeng Zha; Yang Liu; Zhihua Jiang

Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models

Guanghui Ye, Huan Zhao, Zhixue Zhao, Xupeng Zha, Yang Liu, Zhihua Jiang

Abstract

We revisit knowledge-based visual reasoning (KB-VR) in light of modern advances in multimodal large language models (MLLMs), and make the following contributions: (i) We propose Visual Knowledge Card (VKC) – a novel image that incorporates not only internal visual knowledge (e.g., scene-aware information) detected from the raw image, but also external world knowledge (e.g., attribute or object knowledge) produced by a knowledge generator; (ii) We present VKC-based Multi-Image Reasoning (VKC-MIR) – a four-stage pipeline which harnesses a state-of-the-art scene perception engine to construct an initial VKC (Stage-1), a powerful LLM to generate relevant domain knowledge (Stage-2), an excellent image editing toolkit to introduce generated knowledge into the updated VKC (Stage-3), and finally, an emerging multi-image MLLM to solve the VKC-enhanced task (Stage-4). By performing experiments on three popular KB-VR benchmarks, our approach achieves new state-of-the-art results compared to previous top-performing models.

Anthology ID:: 2025.acl-long.1063
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21883–21896
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1063/
DOI:
Bibkey:
Cite (ACL):: Guanghui Ye, Huan Zhao, Zhixue Zhao, Xupeng Zha, Yang Liu, and Zhihua Jiang. 2025. Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21883–21896, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models (Ye et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1063.pdf

PDF Cite Search Fix data