FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models

Junzhe Jiang; Chang Yang; Aixin Cui; Sihan Jin; Yujing Zhang; Yilin Xiao; Ruiyu Wang; Bo Li; Xiao Huang; Danny Dongning Sun; Xinrun Wang

FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models

Junzhe Jiang, Chang Yang, Aixin Cui, Sihan Jin, Yujing Zhang, Yilin Xiao, Ruiyu Wang, Bo Li, Xiao Huang, Danny Dongning Sun, Xinrun Wang

Abstract

Financial management is high-stakes, where small errors can propagate into reporting deviations and costly downstream decisions, yet real-world workflows remain labor-intensive and fragmented, and existing automation supports only isolated steps rather than complete workflows. Large language models (LLMs) show promise in automating financial workflows, but current benchmarks lack domain-specific data, realistic workflow-level task design, and standardized workflow-level evaluation. To address these gaps, we present **FinMaster**, a benchmark for evaluating large language models on full financial management workflows spanning financial literacy, accounting, auditing, and consulting. **FinMaster** comprises three modules: *FinSim* generates synthetic datasets compliant with real-world accounting standards for diverse company types, enabling realistic evaluation without relying on proprietary financial records. *FinSuite* offers 183 tasks across core financial domains. *FinEval* provides a unified evaluation framework. Extensive experiments on state-of-the-art models including GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-V3 reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to 40% on complex scenarios requiring multi-step reasoning. This degradation reflects error propagation, where accuracy reaches 58% for single-metric calculations but decreases to 37% in multi-metric settings. **FinMaster** provides scalable and reproducible benchmarking for realistic end-to-end financial workflows, helping advance reliable deployment of LLMs in financial practice.

Anthology ID:: 2026.findings-acl.385
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7787–7844
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.385/
DOI:
Bibkey:
Cite (ACL):: Junzhe Jiang, Chang Yang, Aixin Cui, Sihan Jin, Yujing Zhang, Yilin Xiao, Ruiyu Wang, Bo Li, Xiao Huang, Danny Dongning Sun, and Xinrun Wang. 2026. FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7787–7844, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models (Jiang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.385.pdf
Checklist:: 2026.findings-acl.385.checklist.pdf

PDF Cite Search Checklist Fix data