A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration

Zhiyang Zhang; Yaping Zhang; Yupu Liang; Zhiyuan Chen; Lu Xiang; Yang Zhao; Yu Zhou; Chengqing Zong

A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration

Zhiyang Zhang, Yaping Zhang, Yupu Liang, Zhiyuan Chen, Lu Xiang, Yang Zhao, Yu Zhou, Chengqing Zong

Abstract

Document Image Translation (DIT), which aims at translating documents in images from source language to the target, plays an important role in Document Intelligence. It requires a comprehensive understanding of document multi-modalities and a focused concentration on relevant textual regions during translation. However, most existing methods usually rely on the vanilla encoder-decoder paradigm, severely losing concentration on key regions that are especially crucial for complex-layout document translation. To tackle this issue, in this paper, we propose a new Query-Response DIT framework (QRDIT). QRDIT reformulates the DIT task into a parallel response/translation process of the multiple queries (i.e., relevant source texts), explicitly centralizing its focus toward the most relevant textual regions to ensure translation accuracy. A novel dynamic aggregation mechanism is also designed to enhance the text semantics in query features toward translation. Extensive experiments in four translation directions on three benchmarks demonstrate its state-of-the-art performance, showing significant translation quality improvements toward whole-page complex-layout document images.

Anthology ID:: 2025.findings-acl.372
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:: Findings | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7138–7149
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.372/
DOI:
Bibkey:
Cite (ACL):: Zhiyang Zhang, Yaping Zhang, Yupu Liang, Zhiyuan Chen, Lu Xiang, Yang Zhao, Yu Zhou, and Chengqing Zong. 2025. A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7138–7149, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.372.pdf

PDF Cite Search Fix data