Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

Junzhe Zhang, Huixuan Zhang, Xinyu Hu, Li Lin, Mingqi Gao, Shi Qiu, Xiaojun Wan


Abstract
Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing researches often simply collect large-scale evaluation data for training, while overlooking the quality of evaluation data. What’s more, current proposed evaluation models often struggle to achieve consistently strong performance across both image-to-text (I2T) and text-to-image (T2I) tasks. In this paper, through rigorous quality control strategies, we construct a comprehensive multimodal evaluation dataset, Minos-57K, with evaluation samples across 15 datasets, for developing the multimodal evaluation model Minos with SFT and preference alignment training strategies. Notably, despite using less than half the scale of the training data of prior work, our model achieves state-of-the-art evaluation performance across 16 out-of-domain datasets covering both I2T and T2I tasks among all open-source multimodal evaluation models and remain competitive with closed-source models. Extensive experiments demonstrate the importance of leveraging quality control process, jointly training on evaluation data from both I2T and T2I generation tasks and further preference alignment.
Anthology ID:
2026.findings-acl.744
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15108–15132
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.744/
DOI:
Bibkey:
Cite (ACL):
Junzhe Zhang, Huixuan Zhang, Xinyu Hu, Li Lin, Mingqi Gao, Shi Qiu, and Xiaojun Wan. 2026. Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15108–15132, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text (Zhang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.744.pdf
Checklist:
 2026.findings-acl.744.checklist.pdf