Pei Fu
2026
Doc-V*: Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA
Yuanlei Zheng | Pei Fu | Hang Li | Ziyang Wang | Yuyi Zhang | Wenyu Ruan | Xiaojin Zhang | Zhongyu Wei | Zhenbo Luo | Jian Luan | Wei Chen | Xiang Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuanlei Zheng | Pei Fu | Hang Li | Ziyang Wang | Yuyi Zhang | Wenyu Ruan | Xiaojin Zhang | Zhongyu Wei | Zhenbo Luo | Jian Luan | Wei Chen | Xiang Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-V*, an OCR-free agentic framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-V* begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-V* balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-V* outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to 47.9% over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.
2025
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Pei Fu | Tongkun Guan | Zining Wang | Zhentao Guo | Chen Duan | Hao Sun | Boming Chen | Qianyi Jiang | Jiayao Ma | Kai Zhou | Junfeng Luo
Findings of the Association for Computational Linguistics: ACL 2025
Pei Fu | Tongkun Guan | Zining Wang | Zhentao Guo | Chen Duan | Hao Sun | Boming Chen | Qianyi Jiang | Jiayao Ma | Kai Zhou | Junfeng Luo
Findings of the Association for Computational Linguistics: ACL 2025
The recent emergence of Multi-modal Large Language Models (MLLMs) has introduced a new dimension to the Text-rich Image Understanding (TIU) field, with models demonstrating impressive and inspiring performance. However, their rapid evolution and widespread adoption have made it increasingly challenging to keep up with the latest advancements. To address this, we present a systematic and comprehensive survey to facilitate further research on TIU MLLMs. Initially, we outline the timeline, architecture, and pipeline of nearly all TIU MLLMs. Then, we review the performance of selected models on mainstream benchmarks. Finally, we explore promising directions, challenges, and limitations within the field.