Yuanheng Zhu
2026
CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic
Yaocheng Zhang | Haohuan Huang | Zijun Song | Zijie Zhao | Qichao Zhang | Yuanheng Zhu | Dongbin Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Yaocheng Zhang | Haohuan Huang | Zijun Song | Zijie Zhao | Qichao Zhang | Yuanheng Zhu | Dongbin Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.
2025
RLAE: Reinforcement Learning-Assisted Ensemble for LLMs
Yuqian Fu | Yuanheng Zhu | Jiajun Chai | Guojun Yin | Wei Lin | Qichao Zhang | Dongbin Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yuqian Fu | Yuanheng Zhu | Jiajun Chai | Guojun Yin | Wei Lin | Qichao Zhang | Dongbin Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose **R**einforcement **L**earning-**A**ssisted **E**nsemble for LLMs (RLAE), a novel framework that reformulates LLM ensemble through the lens of a Markov Decision Process (MDP). Our approach introduces a RL agent that dynamically adjusts ensemble weights by considering both input context and intermediate generation states, with the agent being trained using rewards that directly correspond to the quality of final outputs. We implement RLAE using both single-agent and multi-agent reinforcement learning algorithms (RLAE_PPO and RLAE_MAPPO ), demonstrating substantial improvements over conventional ensemble methods. Extensive evaluations on a diverse set of tasks show that RLAE outperforms existing approaches by up to 3.3\\% accuracy points, offering a more effective framework for LLM ensembling. Furthermore, our method exhibits superior generalization capabilities across different tasks without the need for retraining, while simultaneously achieving lower time latency. The source code is available at here.