A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration
Zhiyang Zhang, Yaping Zhang, Yupu Liang, Zhiyuan Chen, Lu Xiang, Yang Zhao, Yu Zhou, Chengqing Zong
Abstract
Document Image Translation (DIT), which aims at translating documents in images from source language to the target, plays an important role in Document Intelligence. It requires a comprehensive understanding of document multi-modalities and a focused concentration on relevant textual regions during translation. However, most existing methods usually rely on the vanilla encoder-decoder paradigm, severely losing concentration on key regions that are especially crucial for complex-layout document translation. To tackle this issue, in this paper, we propose a new Query-Response DIT framework (QRDIT). QRDIT reformulates the DIT task into a parallel response/translation process of the multiple queries (i.e., relevant source texts), explicitly centralizing its focus toward the most relevant textual regions to ensure translation accuracy. A novel dynamic aggregation mechanism is also designed to enhance the text semantics in query features toward translation. Extensive experiments in four translation directions on three benchmarks demonstrate its state-of-the-art performance, showing significant translation quality improvements toward whole-page complex-layout document images.- Anthology ID:
- 2025.findings-acl.372
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venues:
- Findings | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7138–7149
- Language:
- URL:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.372/
- DOI:
- Cite (ACL):
- Zhiyang Zhang, Yaping Zhang, Yupu Liang, Zhiyuan Chen, Lu Xiang, Yang Zhao, Yu Zhou, and Chengqing Zong. 2025. A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7138–7149, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration (Zhang et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.372.pdf