Ma Shuguang


2026

Ensuring the accuracy of financial documents is critical for economic analysis, regulatory compliance, and corporate decision-making. Several studies have shown that Large Language Models (LLMs) perform well in many financial tasks, such as stock price movements and financial analytics. However, a critical task remains unexplored: the ability of LLMs to identify errors in financial documents. In this paper, we introduce **FinED-Bench**, the first publicly Benchmark for Financial Error Detection across three levels of cognitive complexity. FinED-Bench covers nine real-world financial scenarios, and includes over 900 documents reported in 2025 that are unseen by existing language models. We detail the benchmark construction process and evaluate several advanced LLMs (e.g., GPT-4o, Qwen3-14B) on this tasks, which requires both financial domain knowledge and reasoning capabilities. Experimental results show that current LLMs still struggle with this task, especially in high-complexity cases. Besides, supervised fine-tuning can significantly improve the performance of weaker LLMs on this task. Our data and code are available at https://anonymous.4open.science/r/FinED-Bench-406F.
Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the effectiveness of MaSTer on mathematical tasks. Additionally, we validate the framework’s extensibility to non-mathematical tasks, highlighting its broad applicability. Furthermore, we demonstrate that the synthesized variants generated by MaSTer can be utilized as a fine-tuning dataset to significantly enhance the model’s robustness.
Reinforcement learning (RL) has emerged as a key approach for improving long chain-of-thought (CoT) reasoning in large language models (LLMs). However, existing methods such as GRPO often break down when task difficulty exceeds the model’s capacity, resulting in sparse rewards and inefficient training. While prior work attempts to address this issue using off-policy data, it frequently introduces distributional mismatch, leading to unstable policy updates.In this work, we identify a fundamental issue underlying these limitations, which we term *low training affinity*, and propose **Affinity**, the first quantitative metric for measuring the compatibility between external guidance and a model’s intrinsic policy. Based on this insight, we introduce **HINT**, an adaptive framework designed to enhance reasoning performance while explicitly preserving high Affinity.HINT consists of two key components. First, instead of providing partial answers, it introduces **Meta-Hints**, which serve as abstract cognitive scaffolding that guides the model to independently construct solutions. Second, we propose **Affinity-Aware Policy Optimization (AAPO)**, which dynamically adjusts the learning objective based on the Affinity signal to ensure stable training.Extensive experiments across diverse benchmarks demonstrate that HINT consistently outperforms strong baselines, while achieving improved training stability and robust generalization to out-of-distribution tasks. Code is available at: https://github.com/ViviqwerAsd/HINT
Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency–performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency–performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.