Yuan Li
Other people with similar names: Yuan Li, Yuan Li
Unverified author pages with similar names: Yuan Li
2026
Efficient KL Divergence Estimation via Truncated Top-K Integration for Large Language Models
Xinyuan Wang | Luozhijie Jin | Bo Wang | Yuan Li | Zhangyue Yin | Xipeng Qiu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinyuan Wang | Luozhijie Jin | Bo Wang | Yuan Li | Zhangyue Yin | Xipeng Qiu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kullback-Leibler (KL) divergence regularization is essential for stabilizing reinforcement learning from human feedback (RLHF) in large language models (LLMs), yet its exact computation requires summing over vocabularies of all tokens, incurring prohibitive memory costs during training. Existing stochastic estimators circumvent this bottleneck by estimating KL divergence using only the sampled token from the trajectory, but suffer from high variance (k1) or systematic bias (k2). We propose TIKE (Top-k Importance-weighted KL Estimator), which exploits the Zipfian structure of language model distributions: by deterministically integrating over only the top-k tokens, TIKE captures most of the probability mass while effectively reducing memory cost. To ensure correctness in off-policy settings characteristic of Group Relative Policy Optimization (GRPO), we incorporate importance sampling weights that correct for distribution shift between rollout and optimization policies. Experiments on models across diverse benchmarks demonstrate that TIKE consistently outperforms stochastic baselines, while exhibiting substantially lower gradient variance. Our analysis reveals that TIKE closely tracks the exact Rao-Blackwellized estimator with near-zero variance, offering a practical path toward stable, memory-efficient KL regularization for reasoning-intensive LLMs training.
2025
R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning
Yuan Li | Qi Luo | Xiaonan Li | Bufan Li | Qinyuan Cheng | Bo Wang | Yining Zheng | Yuxin Wang | Zhangyue Yin | Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2025
Yuan Li | Qi Luo | Xiaonan Li | Bufan Li | Qinyuan Cheng | Bo Wang | Yining Zheng | Yuxin Wang | Zhangyue Yin | Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2025
Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows.To address these limitations, we propose R3-RAG, which uses Reinforcement learning to make the LLM learn how to Reason and Retrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment.Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer.Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers.
VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction
Jie Yang | Jiajun Chen | Zhangyue Yin | Shuo Chen | Yuxin Wang | Yiran Guo | Yuan Li | Yining Zheng | Xuanjing Huang | Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2025
Jie Yang | Jiajun Chen | Zhangyue Yin | Shuo Chen | Yuxin Wang | Yiran Guo | Yuan Li | Yining Zheng | Xuanjing Huang | Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2025
Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments’ complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on GitHub.