Li Wentao
2023
Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding
Haoli Bai
|
Zhiguang Liu
|
Xiaojun Meng
|
Li Wentao
|
Shuang Liu
|
Yifeng Luo
|
Nian Xie
|
Rongfu Zheng
|
Liangwei Wang
|
Lu Hou
|
Jiansheng Wei
|
Xin Jiang
|
Qun Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding (VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that Wukong-Reader brings superior performance on various VDU tasks in both English and Chinese. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.
Search
Co-authors
- Haoli Bai 1
- Zhiguang Liu 1
- Xiaojun Meng 1
- Shuang Liu 1
- Yifeng Luo 1
- show all...
Venues
- acl1