Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval

Zheng Liu, Ze Liu, Zhengyang Liang, Junjie Zhou, Shitao Xiao, Chao Gao, Chen Jason Zhang, Defu Lian


Abstract
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called Visualized Information Retrieval, or Vis-IR, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called Screenshots, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create VIRA (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop UniSE (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct MVRB (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our data, model and benchmark have been made publicly available, which lays a solid foundation for this emerging field.
Anthology ID:
2025.acl-long.943
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19238–19261
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.943/
DOI:
Bibkey:
Cite (ACL):
Zheng Liu, Ze Liu, Zhengyang Liang, Junjie Zhou, Shitao Xiao, Chao Gao, Chen Jason Zhang, and Defu Lian. 2025. Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19238–19261, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval (Liu et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.943.pdf