A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Qiang Sun, Geoffrey Martin, Wei Liu, Yifan Peng


Abstract
Visually Rich Document Understanding (VRDU) has become a pivotal area of research, driven by the need to automatically interpret documents that contain intricate visual, textual, and structural elements. Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant promise in this domain, including both OCR-based and OCR-free approaches for information extraction from document images. This survey reviews recent advances in MLLM-based VRDU, highlighting emerging trends and promising research directions with a focus on two key aspects: (1) techniques for representing and integrating textual, visual, and layout features; (2) training paradigms, including pretraining, instruction tuning, and training strategies. Moreover, we address challenges such as data scarcity, handling multi-page and multilingual documents, and integrating emerging trends such as Retrieval-Augmented Generation and agentic frameworks. Our analysis offers a roadmap for advancing MLLM-based VRDU toward more scalable, reliable, and adaptable systems.
Anthology ID:
2026.findings-acl.652
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13319–13340
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.652/
DOI:
Bibkey:
Cite (ACL):
Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Qiang Sun, Geoffrey Martin, Wei Liu, and Yifan Peng. 2026. A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends. In Findings of the Association for Computational Linguistics: ACL 2026, pages 13319–13340, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends (Ding et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.652.pdf
Checklist:
 2026.findings-acl.652.checklist.pdf