Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

Yupu Liang; Yaping Zhang; Zhiyang Zhang; Yang Zhao; Lu Xiang; Chengqing Zong; Yu Zhou

Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

Yupu Liang, Yaping Zhang, Zhiyang Zhang, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou

Abstract

Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix Modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an imageonly encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios. The code will be released upon acceptance.

Anthology ID:: 2025.acl-long.606
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12391–12408
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.606/
DOI:
Bibkey:
Cite (ACL):: Yupu Liang, Yaping Zhang, Zhiyang Zhang, Yang Zhao, Lu Xiang, Chengqing Zong, and Yu Zhou. 2025. Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12391–12408, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation (Liang et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.606.pdf

PDF Cite Search Fix data