Visually Guided Generative Text-Layout Pre-training for Document Intelligence

Zhiming Mao, Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong


Abstract
Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.
Anthology ID:
2024.naacl-long.264
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4713–4730
Language:
URL:
https://aclanthology.org/2024.naacl-long.264
DOI:
10.18653/v1/2024.naacl-long.264
Bibkey:
Cite (ACL):
Zhiming Mao, Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. 2024. Visually Guided Generative Text-Layout Pre-training for Document Intelligence. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4713–4730, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Visually Guided Generative Text-Layout Pre-training for Document Intelligence (Mao et al., NAACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.naacl-long.264.pdf