Xuhong Wang

2026

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints
Zhenyun Yin | Shujie Wang | Xuhong Wang | Xingjun Ma | Yingchun Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models with search capabilities frequently exhibit miscalibrated confidence, producing incorrect answers with high certainty. We present Deliberative Searcher, a reasoning-primary framework that integrates search operations into chain-of-thought generation while maintaining explicit confidence calibration. Our method employs constrained reinforcement learning with adaptive Lagrangian multipliers to jointly optimize correctness and reliability. Experiments across five benchmarks demonstrate substantial improvements: our 7B model reduces average false-certain rates from 54% in baselines to 2%, while our 72B variant achieves competitive accuracy with closed-source models and reduces false-certain rates to 9%. The well-calibrated confidence scores also enable more efficient test-time compute: instead of standard majority voting, we use confidence-weighted aggregation and match the performance of 16-sample majority voting with only 4 samples, a 4× reduction in inference compute. These results establish calibrated confidence as a foundation for both trustworthy outputs and adaptive test-time compute, demonstrating the value of the proposed constrained RL framework in search-augmented language models.

pdf bib abs

Recent advances in Graphical User Interface (GUI) and embodied navigation have driven progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of unifying GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks using a single formulation. (ii) employs a unified reinforcement learning framework on the mix data to improve generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further demonstrate the efficacy of our unified training strategy, data mixing strategy, and reward design. Resources will be released to the community.

pdf bib abs

From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks
Qingyu Ren | Tianjun Pan | Xingzhou Chen | Xuhong Wang
Findings of the Association for Computational Linguistics: ACL 2026

Large language models have achieved remarkable progress in text generation but still struggle with generative writing tasks. In terms of evaluation, existing evaluation benchmarks include few requirement types and writing reward models are not evaluated. In terms of training, existing studies often enhance writing ability through reinforcement learning with verifiable rewards (RLVR). Howerver, existing reward model training remains coarse-grained. To address these issues, we introduce W²Bench, a comprehensive evaluation benchmark, and WRL, a fine-grained training framework. W²Bench covers five task categories and seven requirement types, enabling systematic evaluation of both writing and writing reward models by measuring the correlation between reward rankings and golden rankings. WRL constructs positive and negative samples by dropping instruction requirements to construct positive and negative examples, allowing more precise reward model training. Experiments show that our models achieve substantial improvements on various writing benchmarks and exhibit strong generalization. We will release our code and data to support future research.

pdf bib abs

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
Wenxuan Xie | Xuhong Wang
Findings of the Association for Computational Linguistics: ACL 2026

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model’s embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code and data will be made public upon publication.

2025

pdf bib abs

Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning
Qianxi He | Qingyu Ren | Shanzhe Lei | Xuhong Wang | Yingchun Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, most current methods rely on rule-based evaluations of answer correctness, overlooking the importance of confidence-aware reasoning, especially for small to medium-sized models. These models often receive rewards for speculative answers without generating coherent reasoning chains. To address this limitation, we propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in https://github.com/qianxiHe147/C2RM.

Co-authors

Xin Tan 1

Venues

Fix author