DocAssistant: Integrating Key-region Reading and Step-wise Reasoning for Robust Document Visual Question Answering

Jinxu Zhang, Qiyuan Fan, Yu Zhang


Abstract
Understanding the multimodal documents is essential for accurately extracting relevant evidence and using it for reasoning. Existing document understanding models struggle to focus on key information and tend to generate answers straightforwardly, ignoring evidence from source documents and lacking interpretability. In this work, we improve the visual encoder to focus on key information relevant to the question and address the shortcomings of existing document visual question-answering datasets to provide the model with the ability to answer questions step-wise, dubbed DocAssistant. Specifically, for the visual side, we propose an effective vision-language adaptation that fuses text into visual encoders without compromising the performance of the original model. For the language side, we use Multimodal Large Language Models (MLLMs) as data generators and checkers to produce high-quality step-wise question-and-answer pairs for document images. We then use the generated high-quality data to train our enhanced model, specifically designed to solve complex questions that require reasoning or multi-hop question answering. The experimental results demonstrate the effectiveness of the model.
Anthology ID:
2025.findings-emnlp.187
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3496–3511
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.187/
DOI:
10.18653/v1/2025.findings-emnlp.187
Bibkey:
Cite (ACL):
Jinxu Zhang, Qiyuan Fan, and Yu Zhang. 2025. DocAssistant: Integrating Key-region Reading and Step-wise Reasoning for Robust Document Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3496–3511, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
DocAssistant: Integrating Key-region Reading and Step-wise Reasoning for Robust Document Visual Question Answering (Zhang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.187.pdf
Checklist:
 2025.findings-emnlp.187.checklist.pdf