Hardy Chen
2025
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
Wentao Ge
|
Shunian Chen
|
Hardy Chen
|
Nuo Chen
|
Junying Chen
|
Zhihong Chen
|
Wenya Xie
|
Shuo Yan
|
Chenghao Zhu
|
Ziyue Lin
|
Dingjie Song
|
Xidong Wang
|
Anningzhe Gao
|
Zhang Zhiyi
|
Jianquan Li
|
Xiang Wan
|
Benyou Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating objective queries without considering real-world user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 26 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria.
Search
Fix data
Co-authors
- Shunian Chen 1
- Nuo Chen 1
- Junying Chen 1
- Zhihong Chen 1
- Anningzhe Gao 1
- show all...