Renrui Zhang
2025
SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
Ziyu Guo
|
Renrui Zhang
|
Hao Chen
|
Jialin Gao
|
Dongzhi Jiang
|
Jiaze Wang
|
Pheng-Ann Heng
Findings of the Association for Computational Linguistics: ACL 2025
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io
2024
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
Shitian Zhao
|
Renrui Zhang
|
Xu Luo
|
Yan Wang
|
Shanghang Zhang
|
Peng Gao
Findings of the Association for Computational Linguistics: EMNLP 2024
Model fusing has always been an important topic, especially in an era where large language models (LLM) and multi-modal language models (MLM) with different architectures, parameter sizes and training pipelines, are being created all the time. In this work, we propose a post-hoc framework, aiming at fusing heterogeneous models off-the-shell, which we call likelihood composition, and the basic idea is to compose multiple models’ likelihood distribution when doing a multi-choice visual-question-answering task. Here the core concept, likelihood, is actually the log-probability of the candidate answer. In likelihood composition, we introduce some basic operations: debias, highlight, majority-vote and ensemble. By combining (composing) these basic elements, we get the mixed composition methods: mix-composition. Through conducting comprehensive experiments on 9 VQA datasets and 10 MLMs, we prove the effectiveness of mix-composition compared with simple ensemble or majority-vote methods. In this framework, people can propose new basic composition methods and combine them to get the new mixed composition methods. We hope our proposed likelihood composition can provide a new perspective of fusing heterogeneous models and inspire the exploration under this framework.
Search
Fix author
Co-authors
- Hao Chen (陈昊) 1
- Peng Gao 1
- Jialin Gao 1
- Ziyu Guo 1
- Pheng-Ann Heng 1
- show all...