DocAgent: An Agentic Framework for Multi-Modal Long-Context Document Understanding

Li Sun, Liu He, Shuyue Jia, Yangfan He, Chenyu You


Abstract
Recent advances in large language models (LLMs) have demonstrated significant promise in document understanding and question-answering. Despite the progress, existing approaches can only process short documents due to limited context length or fail to fully leverage multi-modal information. In this work, we introduce DocAgent, a multi-agent framework for long-context document understanding that imitates human reading practice. Specifically, we first extract a structured, tree-formatted outline from documents to help agents identify relevant sections efficiently. Further, we develop an interactive reading interface that enables agents to query and retrieve various types of content dynamically. To ensure answer reliability, we introduce a reviewer agent that cross-checks responses using complementary sources and maintains a task-agnostic memory bank to facilitate knowledge sharing across tasks. We evaluate our method on two long-context document understanding benchmarks, where it bridges the gap to human-level performance by surpassing competitive baselines, while maintaining a short context length. Our code is available at https://github.com/lisun-ai/DocAgent.
Anthology ID:
2025.emnlp-main.893
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17712–17727
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.893/
DOI:
Bibkey:
Cite (ACL):
Li Sun, Liu He, Shuyue Jia, Yangfan He, and Chenyu You. 2025. DocAgent: An Agentic Framework for Multi-Modal Long-Context Document Understanding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17712–17727, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
DocAgent: An Agentic Framework for Multi-Modal Long-Context Document Understanding (Sun et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.893.pdf
Checklist:
 2025.emnlp-main.893.checklist.pdf