Xiangyu Wen


2025

Large Language Models have advanced significantly in complex reasoning, often leveraging external reward model to improve the reliability of their multi-step processes. However, existing process verification methods struggle with reliably assessing incomplete reasoning traces and are limited by the cost of high-quality human annotations or the inherent noise in automatically generated labels. Therefore, we present Dyve, a dynamic process verifier that enhances reasoning error detection in large language models by integrating fast and slow thinking, inspired by Kahneman’s Systems Theory. Dyve adaptively applies immediate token-level confirmation (System 1) for straightforward steps and comprehensive analysis (System 2) for complex ones. Unlike traditional verifiers that only evaluate final outputs, Dyve employs a step-wise consensus-filtered supervision strategy, leveraging Monte Carlo estimation, LLM-as-a-Judge, and specialized reasoning models to extract high-quality training signals from noisy rollouts. Experimental results on ProcessBench and the MATH dataset confirm that Dyve significantly outperforms existing process-based verifiers and boosts performance in Best-of-N settings while maintaining computational efficiency by strategically allocating verification resources.
Task-oriented dialogue (TOD) systems are widely used across various domains, including customer service, appointment scheduling, and technical support. In real-world scenarios, such systems must adhere to given operational guidelines. However, existing solutions based on large language models often cannot achieve strict guideline compliance, even when fine-tuned with domain knowledge. To address this issue, we introduce a novel TOD system named GuidedTOD, which explicitly considers domain-specific guidelines by integrating a policy module. This module employs a Markov Chain, termed Chained Prior, to efficiently encode and dynamically update guideline knowledge. During inference, the Chained Prior re-ranks outputs from the domain-expert language model using beam search, ensuring guideline adherence. Experimental results show that GuidedTOD significantly improves guideline compliance, achieving approximately 20% better action prediction accuracy than state-of-the-art solutions. Code is available here: https://github.com/cure-lab/GuidedTOD.