Aixin Cui

2026

Financial management is high-stakes, where small errors can propagate into reporting deviations and costly downstream decisions, yet real-world workflows remain labor-intensive and fragmented, and existing automation supports only isolated steps rather than complete workflows. Large language models (LLMs) show promise in automating financial workflows, but current benchmarks lack domain-specific data, realistic workflow-level task design, and standardized workflow-level evaluation. To address these gaps, we present **FinMaster**, a benchmark for evaluating large language models on full financial management workflows spanning financial literacy, accounting, auditing, and consulting. **FinMaster** comprises three modules: *FinSim* generates synthetic datasets compliant with real-world accounting standards for diverse company types, enabling realistic evaluation without relying on proprietary financial records. *FinSuite* offers 183 tasks across core financial domains. *FinEval* provides a unified evaluation framework. Extensive experiments on state-of-the-art models including GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-V3 reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to 40% on complex scenarios requiring multi-step reasoning. This degradation reflects error propagation, where accuracy reaches 58% for single-metric calculations but decreases to 37% in multi-metric settings. **FinMaster** provides scalable and reproducible benchmarking for realistic end-to-end financial workflows, helping advance reliable deployment of LLMs in financial practice.

Co-authors

Ruiyu Wang 1

Venues

Findings1

Fix author