Guangtao Zhai
2025
Redundancy Principles for MLLMs Benchmarks
Zicheng Zhang
|
Xiangyu Zhao
|
Xinyu Fang
|
Chunyi Li
|
Xiaohong Liu
|
Xiongkuo Min
|
Haodong Duan
|
Kai Chen
|
Guangtao Zhai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs’ performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Xiangyu Zhao
|
Shengyuan Ding
|
Zicheng Zhang
|
Haian Huang
|
Maosongcao Maosongcao
|
Jiaqi Wang
|
Weiyun Wang
|
Xinyu Fang
|
Wenhai Wang
|
Guangtao Zhai
|
Hua Yang
|
Haodong Duan
|
Kai Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities.
Search
Fix author
Co-authors
- Kai Chen 2
- Haodong Duan 2
- Xinyu Fang 2
- Zicheng Zhang 2
- Xiangyu Zhao 2
- show all...
Venues
- acl2