MultiSkill: Evaluating Large Multimodal Models for Fine-grained Alignment Skills

Zhenran Xu; Senbao Shi; Baotian Hu; Longyue Wang; Min Zhang (张民)

doi:10.18653/v1/2024.findings-emnlp.81

MultiSkill: Evaluating Large Multimodal Models for Fine-grained Alignment Skills

Zhenran Xu, Senbao Shi, Baotian Hu, Longyue Wang, Min Zhang

Abstract

We propose MultiSkill, an evaluation protocol that assesses large multimodal models (LMMs) across multiple fine-grained skills for alignment with human values. Recent LMMs have shown various intriguing abilities, such as solving graph theory problems and explaining visual jokes. However, existing multimodal benchmarks have mainly focused on coarse-grained evaluation (e.g., accuracy), without considering the skill composition required by specific instructions. To this end, we present MultiSkill, designed to decompose coarse-level scoring to a fine-grained skill set-level scoring tailored to each instruction. MultiSkill defines five core vision-language capabilities and divides into 12 skills that are necessary to align with user instructions. For evaluation metrics on specific skills, we propose an LMM-based evaluator for open-ended outputs. Based on the diverse instructions collected from 66 datasets spanning 10 domains, we compare multiple representative open-source and proprietary LMMs and find a high correlation between model-based and human-based evaluations. Our experiments underscore the importance of fine-grained evaluation in providing a holistic view of model performance and enhancing the reliability of the evaluation.

Anthology ID:: 2024.findings-emnlp.81
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1506–1523
Language:
URL:: https://preview.aclanthology.org/landing_page/2024.findings-emnlp.81/
DOI:: 10.18653/v1/2024.findings-emnlp.81
Bibkey:
Cite (ACL):: Zhenran Xu, Senbao Shi, Baotian Hu, Longyue Wang, and Min Zhang. 2024. MultiSkill: Evaluating Large Multimodal Models for Fine-grained Alignment Skills. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1506–1523, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: MultiSkill: Evaluating Large Multimodal Models for Fine-grained Alignment Skills (Xu et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2024.findings-emnlp.81.pdf

PDF Cite Search Fix data