Geoffrey Martin
2026
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
Yihao Ding | Siwen Luo | Yue Dai | Yanbei Jiang | Zechuan Li | Qiang Sun | Geoffrey Martin | Wei Liu | Yifan Peng
Findings of the Association for Computational Linguistics: ACL 2026
Yihao Ding | Siwen Luo | Yue Dai | Yanbei Jiang | Zechuan Li | Qiang Sun | Geoffrey Martin | Wei Liu | Yifan Peng
Findings of the Association for Computational Linguistics: ACL 2026
Visually Rich Document Understanding (VRDU) has become a pivotal area of research, driven by the need to automatically interpret documents that contain intricate visual, textual, and structural elements. Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant promise in this domain, including both OCR-based and OCR-free approaches for information extraction from document images. This survey reviews recent advances in MLLM-based VRDU, highlighting emerging trends and promising research directions with a focus on two key aspects: (1) techniques for representing and integrating textual, visual, and layout features; (2) training paradigms, including pretraining, instruction tuning, and training strategies. Moreover, we address challenges such as data scarcity, handling multi-page and multilingual documents, and integrating emerging trends such as Retrieval-Augmented Generation and agentic frameworks. Our analysis offers a roadmap for advancing MLLM-based VRDU toward more scalable, reliable, and adaptable systems.
Budget-Aware Routing for Long Clinical Text
Khizar Qureshi | Geoffrey Martin | Yifan Peng
Findings of the Association for Computational Linguistics: ACL 2026
Khizar Qureshi | Geoffrey Martin | Yifan Peng
Findings of the Association for Computational Linguistics: ACL 2026
A key challenge for large language models is token cost per query and overall deployment cost. Clinical inputs are long, heterogeneous, and often redundant, while downstream tasks are short and high stakes. We study budgeted context selection, where a subset of document units is chosen under a strict token budget so an off-the-shelf generator can meet fixed cost and latency constraints. We cast this as a knapsack-constrained subset selection problem with two design choices, unitization that defines document segmentation and selection that determines which units are kept.We propose RCD, a monotone submodular objective that balances relevance, coverage, and diversity. We compare sentence, section, window, and cluster-based unitization, and introduce a routing heuristic that adapts to the budget regime. Experiments on MIMIC discharge notes, Cochrane abstracts, and L-Eval show that optimal strategies depend on the evaluation setting. Positional heuristics perform best at low budgets in extractive tasks, while diversity-aware methods such as MMR improve LLM generation. Selector choice matters more than unitization, with cluster-based grouping reducing performance and other schemes behaving similarly. ROUGE saturates for LLM summaries, while BERTScore better reflects quality differences.