Xinrun Wang


2026

Financial management is high-stakes, where small errors can propagate into reporting deviations and costly downstream decisions, yet real-world workflows remain labor-intensive and fragmented, and existing automation supports only isolated steps rather than complete workflows. Large language models (LLMs) show promise in automating financial workflows, but current benchmarks lack domain-specific data, realistic workflow-level task design, and standardized workflow-level evaluation. To address these gaps, we present **FinMaster**, a benchmark for evaluating large language models on full financial management workflows spanning financial literacy, accounting, auditing, and consulting. **FinMaster** comprises three modules: *FinSim* generates synthetic datasets compliant with real-world accounting standards for diverse company types, enabling realistic evaluation without relying on proprietary financial records. *FinSuite* offers 183 tasks across core financial domains. *FinEval* provides a unified evaluation framework. Extensive experiments on state-of-the-art models including GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-V3 reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to 40% on complex scenarios requiring multi-step reasoning. This degradation reflects error propagation, where accuracy reaches 58% for single-metric calculations but decreases to 37% in multi-metric settings. **FinMaster** provides scalable and reproducible benchmarking for realistic end-to-end financial workflows, helping advance reliable deployment of LLMs in financial practice.
Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability or when training is dominated by a narrow set of recurring problem patterns.To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design.SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model’s capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms other RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

2025

World models achieve remarkable success in predicting future states and planning in complex environments and Large Language Models (LLMs) serve as promising foundation to build general world models. However, their performances are usually constrained by the limited external knowledge to specific environments. Existing research attempts to enhance LLM-based world models through prompting or fine-tuning approaches, which are either requiring human knowledge or computationally extensive. Therefore, we introduce Retrieval-Augmented World Models (RAWM), a novel framework that leverages retrieval-augmented generation to efficiently integrate the external knowledge to LLM-based world models. Our main contributions are threefold: (i) We introduce a memory system and design an embedding model to retrieve relevant experiences as the in-context examples to improve the world model’s predictive accuracy. (ii) We develop a reinforcement learning (RL) training pipeline that fine-tunes a small MLP head on the pre-trained embedding model using Proximal Policy Optimization (PPO), further enhancing prediction performance. (iii) We conduct extensive experiments across three diverse environments, i.e., Game24, BlocksWorld, and BabyAI, demonstrating that RAWM consistently outperforms baseline models and exhibits strong generalizability. By leveraging the retrieval-augmented generation and the efficient RL training pipeline, RAWM dynamically utilizes relevant historical experiences and equips LLMs with environment-specific external knowledge without retraining, enabling more accurate and generalizable predictions.
Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLM’s parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model’s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the model’s parametric knowledge, which undermines the model’s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model’s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/DeepLearnXMU/Faithful-RAG.