A Bounding Box is Worth One Token - Interleaving Layout and Text in a Large Language Model for Document Understanding

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, Can Huang


Abstract
Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout andText in a Large Language Model (LayTextLLM) for document understanding. LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in KIE and VQA. Comprehensive benchmark evaluations reveal significant improvements of LayTextLLM, with a 15.2% increase on KIE tasks and 10.7% on VQA tasks compared to previous SOTA OCR-based LLMs. All resources are available at URL masked for anonymous review.
Anthology ID:
2025.findings-acl.379
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7252–7273
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.379/
DOI:
Bibkey:
Cite (ACL):
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, and Can Huang. 2025. A Bounding Box is Worth One Token - Interleaving Layout and Text in a Large Language Model for Document Understanding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7252–7273, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
A Bounding Box is Worth One Token - Interleaving Layout and Text in a Large Language Model for Document Understanding (Lu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.379.pdf