Lihong Li
2026
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
Junbo Li | Peng Zhou | Rui Meng | Meet P. Vadera | Lihong Li | Yang Li
Findings of the Association for Computational Linguistics: EACL 2026
Junbo Li | Peng Zhou | Rui Meng | Meet P. Vadera | Lihong Li | Yang Li
Findings of the Association for Computational Linguistics: EACL 2026
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.
Do LLMs Catch Their Own Mistakes? A Comprehensive Benchmark for Reflective Tool Use LLMs
Zheyuan Liu | Liqiang Xiao | Yang Li | Hyokun Yun | Lihong Li | Chao Zhang | Meng Jiang
Findings of the Association for Computational Linguistics: ACL 2026
Zheyuan Liu | Liqiang Xiao | Yang Li | Hyokun Yun | Lihong Li | Chao Zhang | Meng Jiang
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) increasingly rely on external tools to complete complex tasks, yet their ability to recognize and correct their own tool-use mistakes remains underexplored. Existing benchmarks primarily evaluate planning and execution success, overlooking the self-reflective dimension of tool use. To address this gap, we present ReflecTool-Bench, the first benchmark designed to assess LLMs’ self-reflective reasoning in tool-augmented multi-turn dialogues. ReflecTool-Bench covers 10 domains with 88 distinct APIs and 968 annotated dialogues, systematically injecting diverse error types arising from both user and assistant behavior. The benchmark defines two complementary evaluation setups: the Critique task, where models diagnose errors in third-party dialogues, and the Self-Reflection Task, where models must detect and repair their own prior tool-use mistakes. We introduce fine-grained metrics for error detection, error classification, correction accuracy, and explanation quality, enabling a holistic assessment of reflective reasoning. Evaluations across 12 state-of-the-art models, including both API-based closed source models and open source models, reveal that while models can reliably identify user-originated errors, they struggle with assistant-originated ones, and performance drops sharply when moving from critique to self-reflection.
Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards
Ming Li | Pei Chen | Zhenhao Zhang | Tao Yang | Xinyang Zhang | Han Li | Tianyu Cao | Ming Zeng | Zhuofeng Wu | Meng Jiang | Huasheng Li | Lihong Li | Bing Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Li | Pei Chen | Zhenhao Zhang | Tao Yang | Xinyang Zhang | Han Li | Tianyu Cao | Ming Zeng | Zhuofeng Wu | Meng Jiang | Huasheng Li | Lihong Li | Bing Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
2025
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
Zhepei Wei | Wenlin Yao | Yao Liu | Weizhi Zhang | Qin Lu | Liang Qiu | Changlong Yu | Puyang Xu | Chao Zhang | Bing Yin | Hyokun Yun | Lihong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhepei Wei | Wenlin Yao | Yao Liu | Weizhi Zhang | Qin Lu | Liang Qiu | Changlong Yu | Puyang Xu | Chao Zhang | Bing Yin | Hyokun Yun | Lihong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and LLaMA-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
2021
The First Workshop on Evaluations and Assessments of Neural Conversation Systems
Wei Wei | Bo Dai | Tuo Zhao | Lihong Li | Diyi Yang | Yun-Nung Chen | Y-Lan Boureau | Asli Celikyilmaz | Alborz Geramifard | Aman Ahuja | Haoming Jiang
The First Workshop on Evaluations and Assessments of Neural Conversation Systems
Wei Wei | Bo Dai | Tuo Zhao | Lihong Li | Diyi Yang | Yun-Nung Chen | Y-Lan Boureau | Asli Celikyilmaz | Alborz Geramifard | Aman Ahuja | Haoming Jiang
The First Workshop on Evaluations and Assessments of Neural Conversation Systems
2018
Subgoal Discovery for Hierarchical Dialogue Policy Learning
Da Tang | Xiujun Li | Jianfeng Gao | Chong Wang | Lihong Li | Tony Jebara
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Da Tang | Xiujun Li | Jianfeng Gao | Chong Wang | Lihong Li | Tony Jebara
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Developing agents to engage in complex goal-oriented dialogues is challenging partly because the main learning signals are very sparse in long conversations. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given successful example dialogues, we propose the Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a multi-level policy by hierarchical reinforcement learning. We demonstrate our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that our approach performs competitively against a state-of-the-art method that requires human-defined subgoals. Moreover, we show that the learned subgoals are often human comprehensible.
Neural Approaches to Conversational AI
Jianfeng Gao | Michel Galley | Lihong Li
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Jianfeng Gao | Michel Galley | Lihong Li
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
This tutorial surveys neural approaches to conversational AI that were developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) social bots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between neural approaches and traditional symbolic approaches, and discuss the progress we have made and challenges we are facing, using specific systems and models as case studies.
2017
End-to-End Task-Completion Neural Dialogue Systems
Xiujun Li | Yun-Nung Chen | Lihong Li | Jianfeng Gao | Asli Celikyilmaz
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Xiujun Li | Yun-Nung Chen | Lihong Li | Jianfeng Gao | Asli Celikyilmaz
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
One of the major drawbacks of modularized task-completion dialogue systems is that each module is trained individually, which presents several challenges. For example, downstream modules are affected by earlier modules, and the performance of the entire system is not robust to the accumulated errors. This paper presents a novel end-to-end learning framework for task-completion dialogue systems to tackle such issues. Our neural dialogue system can directly interact with a structured database to assist users in accessing information and accomplishing certain tasks. The reinforcement learning based dialogue manager offers robust capabilities to handle noises caused by other components of the dialogue system. Our experiments in a movie-ticket booking domain show that our end-to-end system not only outperforms modularized dialogue system baselines for both objective and subjective evaluation, but also is robust to noises as demonstrated by several systematic experiments with different error granularity and rates specific to the language understanding module.
Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning
Baolin Peng | Xiujun Li | Lihong Li | Jianfeng Gao | Asli Celikyilmaz | Sungjin Lee | Kam-Fai Wong
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Baolin Peng | Xiujun Li | Lihong Li | Jianfeng Gao | Asli Celikyilmaz | Sungjin Lee | Kam-Fai Wong
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks. For example, the agent needs to reserve a hotel and book a flight so that there leaves enough time for commute between arrival and hotel check-in. This paper addresses this challenge by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales. The dialogue manager consists of: (1) a top-level dialogue policy that selects among subtasks or options, (2) a low-level dialogue policy that selects primitive actions to complete the subtask given by the top-level policy, and (3) a global state tracker that helps ensure all cross-subtask constraints be satisfied. Experiments on a travel planning task with simulated and real users show that our approach leads to significant improvements over three baselines, two based on handcrafted rules and the other based on flat deep reinforcement learning.
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access
Bhuwan Dhingra | Lihong Li | Xiujun Li | Jianfeng Gao | Yun-Nung Chen | Faisal Ahmed | Li Deng
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bhuwan Dhingra | Lihong Li | Xiujun Li | Jianfeng Gao | Yun-Nung Chen | Faisal Ahmed | Li Deng
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper proposes KB-InfoBot - a multi-turn dialogue agent which helps users search Knowledge Bases (KBs) without composing complicated queries. Such goal-oriented dialogue agents typically need to interact with an external database to access real-world knowledge. Previous systems achieved this by issuing a symbolic query to the KB to retrieve entries based on their attributes. However, such symbolic operations break the differentiability of the system and prevent end-to-end training of neural dialogue agents. In this paper, we address this limitation by replacing symbolic queries with an induced “soft” posterior distribution over the KB that indicates which entities the user is interested in. Integrating the soft retrieval process with a reinforcement learner leads to higher task success rate and reward in both simulations and against real users. We also present a fully neural end-to-end agent, trained entirely from user feedback, and discuss its application towards personalized dialogue agents.
2016
Search
Fix author
Co-authors
- Jianfeng Gao 7
- Xiujun Li 4
- Asli Celikyilmaz 3
- Yun-Nung Chen 3
- Li Deng 3
- Jianshu Chen 2
- Ji He 2
- Xiaodong He 2
- Meng Jiang 2
- Mari Ostendorf 2
- Bing Yin 2
- Hyokun Yun 2
- Chao Zhang 2
- Faisal Ahmad 1
- Aman Ahuja 1
- Y-Lan Boureau 1
- Tianyu Cao 1
- Pei Chen 1
- Bo Dai 1
- Bhuwan Dhingra 1
- Michel Galley 1
- Alborz Geramifard 1
- Tony Jebara 1
- Haoming Jiang 1
- Sungjin Lee 1
- Junbo Li 1
- Yang Li 1
- Yang Li 1
- Ming Li 1
- Han Li 1
- Huasheng Li 1
- Yao Liu 1
- Zheyuan Liu 1
- Qin Lu 1
- Rui Meng 1
- Baolin Peng 1
- Liang Qiu 1
- Da Tang 1
- Meet P. Vadera 1
- Chong Wang 1
- Wei Wei 1
- Zhepei Wei 1
- Kam-Fai Wong 1
- Zhuofeng Wu 1
- Liqiang Xiao 1
- Puyang Xu 1
- Diyi Yang 1
- Tao Yang 1
- Wenlin Yao 1
- Changlong Yu 1
- Ming Zeng 1
- Weizhi Zhang 1
- Zhenhao Zhang 1
- Xinyang Zhang 1
- Tuo Zhao 1
- Peng Zhou 1