MENTOR: Efficient Autoregressive Image Generation with Balanced Multimodal Control

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu


Abstract
Recent text-to-image models achieve impressive visual quality but still face challenges in precise controllability, balancing multimodal inputs, and high training cost for multimodal image generation. To address these limitations, we propose MENTOR, an autoregressive (AR) framework with a two-stage training paradigm for controllable multimodal image generation: (1) a multimodal alignment stage that establishes robust pixel and semantic-level alignment between inputs and generated tokens, followed by (2) a multimodal instruction tuning stage that balance model’s integration of multimodal inputs and enhance generation controllability. Extensive experiments on DreamBench++ and DreamBench demonstrate that, despite modest model size and training resources, achieves a strong balance between textual and visual guidance for controllable image generation, delivering competitive performance at significantly lower computational cost compared to leading baselines. Moreover, our approach attains superior image reconstruction fidelity, broad adaptability across different tasks, and training efficiency.
Anthology ID:
2026.findings-acl.1508
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30167–30193
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1508/
DOI:
Bibkey:
Cite (ACL):
Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, and Junjie Hu. 2026. MENTOR: Efficient Autoregressive Image Generation with Balanced Multimodal Control. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30167–30193, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MENTOR: Efficient Autoregressive Image Generation with Balanced Multimodal Control (Zhao et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1508.pdf
Checklist:
 2026.findings-acl.1508.checklist.pdf