Xiao-Hui Li
2025
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
Chao Deng
|
Jiale Yuan
|
Pi Bu
|
Peijie Wang
|
Zhong-Zhi Li
|
Jian Xu
|
Xiao-Hui Li
|
Yuan Gao
|
Jun Song
|
Bo Zheng
|
Cheng-Lin Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark—LongDocURL—integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed- source models across 26 different configurations, revealing critical performance gaps in this field. The code and data: https://github.com/dengc2023/LongDocURL.
Search
Fix author
Co-authors
- Pi Bu 1
- Chao Deng 1
- Yuan Gao 1
- Zhong-Zhi Li 1
- Cheng-Lin Liu 1
- show all...
Venues
- acl1