VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding

Zhaowei Liu; Xin Guo (郭鑫); Haotian Xia; Lingfeng Zeng; Fangqi Lou; Jinyi Niu; Mengping Li; Qi Qi; Jiahuan Li; Wei Zhang; Yinglong Wang; Weige Cai; Weining Shen; Liwen Zhang

VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding

Zhaowei Liu, Xin Guo, Haotian Xia, Lingfeng Zeng, Fangqi Lou, Jinyi Niu, Mengping Li, Qi Qi, Jiahuan Li, Wei Zhang, Yinglong Wang, Weige Cai, Weining Shen, Liwen Zhang

Abstract

Multimodal large language models (MLLMs) hold great promise for automating complex financial analysis. To comprehensively evaluate their capabilities, we introduce VisFinEval, the first large-scale Chinese benchmark that spans the full front-middle-back office lifecycle of financial tasks. VisFinEval comprises 15,848 annotated question–answer pairs drawn from eight common financial image modalities (e.g., K-line charts, financial statements, official seals), organized into three hierarchical scenario depths: Financial Knowledge & Data Analysis, Financial Analysis & Decision Support, and Financial Risk Control & Asset Optimization. We evaluate 21 state-of-the-art MLLMs in a zero-shot setting. The top model, Qwen-VL-max, achieves an overall accuracy of 76.3%, outperforming non-expert humans but trailing financial experts by over 14 percentage points. Our error analysis uncovers six recurring failure modes—including cross-modal misalignment, hallucinations, and lapses in business-process reasoning—that highlight critical avenues for future research. VisFinEval aims to accelerate the development of robust, domain-tailored MLLMs capable of seamlessly integrating textual and visual financial information. The data and the code are available at https://github.com/SUFE-AIFLM-Lab/VisFinEval.

Anthology ID:: 2025.emnlp-main.1229
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24099–24157
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1229/
DOI:
Bibkey:
Cite (ACL):: Zhaowei Liu, Xin Guo, Haotian Xia, Lingfeng Zeng, Fangqi Lou, Jinyi Niu, Mengping Li, Qi Qi, Jiahuan Li, Wei Zhang, Yinglong Wang, Weige Cai, Weining Shen, and Liwen Zhang. 2025. VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24099–24157, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding (Liu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1229.pdf
Checklist:: 2025.emnlp-main.1229.checklist.pdf

PDF Cite Search Checklist Fix data