Wenya Xie


2025

pdf bib
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
Wentao Ge | Shunian Chen | Hardy Chen | Nuo Chen | Junying Chen | Zhihong Chen | Wenya Xie | Shuo Yan | Chenghao Zhu | Ziyue Lin | Dingjie Song | Xidong Wang | Anningzhe Gao | Zhang Zhiyi | Jianquan Li | Xiang Wan | Benyou Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating objective queries without considering real-world user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 26 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria.