Binghong Wu


2025

pdf bib
A Bounding Box is Worth One Token - Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu | Haiyang Yu | Yanjie Wang | Yongjie Ye | Jingqun Tang | Ziwei Yang | Binghong Wu | Qi Liu | Hao Feng | Han Wang | Hao Liu | Can Huang
Findings of the Association for Computational Linguistics: ACL 2025

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout andText in a Large Language Model (LayTextLLM) for document understanding. LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in KIE and VQA. Comprehensive benchmark evaluations reveal significant improvements of LayTextLLM, with a 15.2% increase on KIE tasks and 10.7% on VQA tasks compared to previous SOTA OCR-based LLMs. All resources are available at URL masked for anonymous review.

pdf bib
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Hao Feng | Shu Wei | Xiang Fei | Wei Shi | Yingdong Han | Lei Liao | Jinghui Lu | Binghong Wu | Qi Liu | Chunhui Lin | Jingqun Tang | Hao Liu | Can Huang
Findings of the Association for Computational Linguistics: ACL 2025

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present Dolphin ( Document Image Parsing via Heterogeneous Anchor Prompting), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin