InImageTrans: Multimodal LLM-based Text Image Machine Translation

Fei Zuo; Kehai Chen; Yu Zhang; Zhengshan Xue; Min Zhang (张民)

InImageTrans: Multimodal LLM-based Text Image Machine Translation

Fei Zuo, Kehai Chen, Yu Zhang, Zhengshan Xue, Min Zhang

Abstract

Multimodal large language models (MLLMs) have shown remarkable capabilities across various downstream tasks. However, when MLLMs are transferred to the text image machine translation (TiMT) task, preliminary experiments reveal that MLLMs suffer from serious repetition and omission hallucinations. To alleviate these issues, this paper first designs an efficient MLLM named InImageTrans for TiMT and then proposes a simple and effective method named multi-conditional direct preference optimization (mcDPO) for advancing the TiMT. Particularly, the proposed mcDPO not only guides the MLLM in rejecting repetition output by creating text output preference pairs automatically, but also guides the MLLM in paying more attention to text information in images by creating image input preference pairs. Furthermore, we build a high-quality benchmark called MCiT for comprehensively evaluating the TiMT capabilities of InImageTrans. Experimental results show that the proposed method significantly outperforms existing open-source MLLMs on MCiT.

Anthology ID:: 2025.findings-acl.1039
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20256–20277
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-acl.1039/
DOI:
Bibkey:
Cite (ACL):: Fei Zuo, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang. 2025. InImageTrans: Multimodal LLM-based Text Image Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 20256–20277, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: InImageTrans: Multimodal LLM-based Text Image Machine Translation (Zuo et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-acl.1039.pdf

PDF Cite Search Fix data