Zeju Li


2026

Complex reasoning with Large Language Models (LLMs) demands a careful balance between accuracy and computational cost. Verification is crucial for reliability but faces trade-off: robust process-based verifiers are computationally prohibitive, while fast verifiers lack precision. We introduce flexive, a unified generative verifier designed to navigate this trade-off by dynamically allocating compute between rapid fast thinking and deliberative slow thinking. A key innovation is our training strategy: we use Group Relative Policy Optimization (GRPO) to specifically enhance the reliability of the fast mode. This targeted training generalizes effectively, elevating the slow mode to state-of-the-art open-source performance. To deploy flexive, we propose the solve-detect-verify (SDV) pipeline. Moving beyond static Best-of-N ranking, SDV employs an iterative refinement process that utilizes likelihood-based probing to detect solution completion, curtailing overthinking, and leverages flexive’s feedback for targeted correction. Solve-detect-verify establishes a new open-source state-of-the-art on ProcessBench, outperforming GenPRM-32B while requiring ~2.3x fewer TFLOPS and 15x less training data. On AIME 2024, the full SDV pipeline achieves 83.3% accuracy, surpassing strong baselines while using significantly fewer tokens.

2025

Knowledge distillation for large language models often uses Chain-of-Thought (CoT) and answer pairs, but existing methods struggle with appropriate supervision signals. Uniform constraints (e.g., cross-entropy) on CoT can enforce literal, verbose reasoning and suppress expressive diversity, while solely semantic constraints on answers can reduce accuracy in classification tasks. This paper proposes ThinkAnswer Loss, an information-theoretic differential supervision framework that decouples CoT and answer supervision. ThinkAnswer Loss applies semantic similarity constraints to the CoT portion while maintaining strict literal matching for the answer. We theoretically demonstrate its connection to mutual information maximization and derive a tight upper bound on generalization error. Experimental validation on text quality assessment and mathematical reasoning tasks shows that our method maintains answer accuracy while effectively reducing CoT length and preserving semantic content, thereby accelerating inference.
Large Language Models have advanced significantly in complex reasoning, often leveraging external reward model to improve the reliability of their multi-step processes. However, existing process verification methods struggle with reliably assessing incomplete reasoning traces and are limited by the cost of high-quality human annotations or the inherent noise in automatically generated labels. Therefore, we present Dyve, a dynamic process verifier that enhances reasoning error detection in large language models by integrating fast and slow thinking, inspired by Kahneman’s Systems Theory. Dyve adaptively applies immediate token-level confirmation (System 1) for straightforward steps and comprehensive analysis (System 2) for complex ones. Unlike traditional verifiers that only evaluate final outputs, Dyve employs a step-wise consensus-filtered supervision strategy, leveraging Monte Carlo estimation, LLM-as-a-Judge, and specialized reasoning models to extract high-quality training signals from noisy rollouts. Experimental results on ProcessBench and the MATH dataset confirm that Dyve significantly outperforms existing process-based verifiers and boosts performance in Best-of-N settings while maintaining computational efficiency by strategically allocating verification resources.