MT3: A Synergistic Multi-Task RL Framework for Specializing MLLMs in Text Image Machine Translation

Zhaopeng Feng; Yupu Liang; Shaosheng Cao; Jiayuan Su; Jiahan Ren; Zhijie Zhou; Wenxuan Huang; Jian Wu; Zuozhu Liu

MT³: A Synergistic Multi-Task RL Framework for Specializing MLLMs in Text Image Machine Translation

Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhijie Zhou, Wenxuan Huang, Jian Wu, Zuozhu Liu

Abstract

Text Image Machine Translation (TIMT)—the task of translating textual content embedded in images—is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT³, a novel Multi-Task RL framework to specialize MLLMs into end-to-end expert TIMT models. MT³ adopts a synergistic multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that provides fine-grained feedback, fostering a controllable and transparent optimization process. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT³-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

Anthology ID:: 2026.acl-long.460
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10140–10157
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.460/
DOI:
Bibkey:
Cite (ACL):: Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhijie Zhou, Wenxuan Huang, Jian Wu, and Zuozhu Liu. 2026. MT3: A Synergistic Multi-Task RL Framework for Specializing MLLMs in Text Image Machine Translation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10140–10157, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MT3: A Synergistic Multi-Task RL Framework for Specializing MLLMs in Text Image Machine Translation (Feng et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.460.pdf
Checklist:: 2026.acl-long.460.checklist.pdf

PDF Cite Search Checklist Fix data