LayoutDIT: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder
Zhiyang Zhang, Yaping Zhang, Yupu Liang, Lu Xiang, Yang Zhao, Yu Zhou, Chengqing Zong
Abstract
Document image translation (DIT) aims to translate text embedded in images from one language to another. It is a challenging task that needs to understand visual layout with text semantics simultaneously. However, existing methods struggle to capture the crucial visual layout in real-world complex document images. In this work, we make the first attempt to incorporate layout knowledge into DIT in an end-to-end way. Specifically, we propose a novel Layout-aware end-to-end Document Image Translation (LayoutDIT) with multi-step conductive decoder. A layout-aware encoder is first introduced to model visual layout relations with raw OCR results. Then a novel multi-step conductive decoder is unified with hidden states conduction across three step-decoders to achieve the document translation step by step. Benefiting from the layout-aware end-to-end joint training, our LayoutDIT outperforms state-of-the-art methods with better parameter efficiency. Besides, we create a new multi-domain document image translation dataset to validate the model’s generalization. Extensive experiments show that LayoutDIT has a good generalization in diverse and complex layout scenes.- Anthology ID:
- 2023.findings-emnlp.673
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10043–10053
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.673
- DOI:
- 10.18653/v1/2023.findings-emnlp.673
- Cite (ACL):
- Zhiyang Zhang, Yaping Zhang, Yupu Liang, Lu Xiang, Yang Zhao, Yu Zhou, and Chengqing Zong. 2023. LayoutDIT: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10043–10053, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- LayoutDIT: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder (Zhang et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2023.findings-emnlp.673.pdf