2025
pdf
bib
abs
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
Chao Deng
|
Jiale Yuan
|
Pi Bu
|
Peijie Wang
|
Zhong-Zhi Li
|
Jian Xu
|
Xiao-Hui Li
|
Yuan Gao
|
Jun Song
|
Bo Zheng
|
Cheng-Lin Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark—LongDocURL—integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed- source models across 26 different configurations, revealing critical performance gaps in this field. The code and data: https://github.com/dengc2023/LongDocURL.
pdf
bib
abs
See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models
Jihao Gu
|
Yingyao Wang
|
Pi Bu
|
Chen Wang
|
Ziming Wang
|
Tengtao Song
|
Donglai Wei
|
Jiale Yuan
|
Yingxiu Zhao
|
Yancheng He
|
Shilong Li
|
Jiaheng Liu
|
Meng Cao
|
Jun Song
|
Yingshui Tan
|
Xiang Li
|
Wenbo Su
|
Xiaoyong Zhu
|
Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2025
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models’ knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field.