LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Yang Xu; Yiheng Xu; Tengchao Lv; Lei Cui; Furu Wei; Guoxin Wang; Yijuan Lu; Dinei Florencio; Cha Zhang; Wanxiang Che; Min Zhang; Lidong Zhou

doi:10.18653/v1/2021.acl-long.201

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 to 0.8420), CORD (0.9493 to 0.9601), SROIE (0.9524 to 0.9781), Kleister-NDA (0.8340 to 0.8520), RVL-CDIP (0.9443 to 0.9564), and DocVQA (0.7295 to 0.8672).

Anthology ID:: 2021.acl-long.201
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: August
Year:: 2021
Address:: Online
Editors:: Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2579–2591
Language:
URL:: https://aclanthology.org/2021.acl-long.201
DOI:: 10.18653/v1/2021.acl-long.201
Bibkey:
Cite (ACL):: Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online. Association for Computational Linguistics.
Cite (Informal):: LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (Xu et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-2024-clasp/2021.acl-long.201.pdf
Video:: https://preview.aclanthology.org/ingest-2024-clasp/2021.acl-long.201.mp4
Code: microsoft/unilm + additional community code
Data: CORD, DocVQA, FUNSD, Kleister NDA, RFUND, RFUND-EN, RVL-CDIP, SQuAD, SROIE

PDF Search Code Video