Huasheng Li
2026
Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards
Ming Li | Pei Chen | Zhenhao Zhang | Tao Yang | Xinyang Zhang | Han Li | Tianyu Cao | Ming Zeng | Zhuofeng Wu | Meng Jiang | Huasheng Li | Lihong Li | Bing Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Li | Pei Chen | Zhenhao Zhang | Tao Yang | Xinyang Zhang | Han Li | Tianyu Cao | Ming Zeng | Zhuofeng Wu | Meng Jiang | Huasheng Li | Lihong Li | Bing Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
2025
LongLeader: A Comprehensive Leaderboard for Large Language Models in Long-context Scenarios
Pei Chen | Hongye Jin | Cheng-Che Lee | Rulin Shao | Jingfeng Yang | Mingyu Zhao | Zhaoyu Zhang | Qin Lu | Kaiwen Men | Ning Xie | Huasheng Li | Bing Yin | Han Li | Lingyun Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Pei Chen | Hongye Jin | Cheng-Che Lee | Rulin Shao | Jingfeng Yang | Mingyu Zhao | Zhaoyu Zhang | Qin Lu | Kaiwen Men | Ning Xie | Huasheng Li | Bing Yin | Han Li | Lingyun Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Language Models (LLMs), exemplified by Claude and LLama, have exhibited impressive proficiency in tackling a myriad of Natural Language Processing (NLP) tasks. Yet, in pursuit of the ambitious goal of attaining Artificial General Intelligence (AGI), there remains ample room for enhancing LLM capabilities. Chief among these is the pressing need to bolster long-context comprehension. Numerous real-world scenarios demand LLMs to adeptly reason across extended contexts, such as multi-turn dialogues or agent workflow. Hence, recent advancements have been dedicated to stretching the upper bounds of long-context comprehension, with models like Claude 3 accommodating up to 200k tokens, employing various techniques to achieve this feat. Aligned with this progression, we propose a leaderboard LongLeader that seeks to comprehensively assess different long-context comprehension abilities of diverse LLMs and context length extension strategies across meticulously selected benchmarks. Specifically, we aim to address the following questions: 1) Do LLMs genuinely deliver the long-context proficiency they purport? 2) Which benchmarks offer reliable metrics for evaluating long-context comprehension? 3) What technical strategies prove effective in extending the understanding of longer contexts? We streamline the evaluation process for LLMs on the benchmarks, offering open-source access to the benchmarks and maintaining a dedicated website for leaderboards. We will continuously curate new datasets and update models to the leaderboards.
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates
Hy Dang | Tianyi Liu | Zhuofeng Wu | Jingfeng Yang | Haoming Jiang | Tao Yang | Pei Chen | Zhengyang Wang | Helen Wang | Huasheng Li | Bing Yin | Meng Jiang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hy Dang | Tianyi Liu | Zhuofeng Wu | Jingfeng Yang | Haoming Jiang | Tao Yang | Pei Chen | Zhengyang Wang | Helen Wang | Huasheng Li | Bing Yin | Meng Jiang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities, yet they often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent. These issues often stem from an incomplete understanding of user goals and inadequate comprehension of tool documentation. While Chain-of-Thought (CoT) prompting has proven effective for enhancing reasoning in general contexts, our analysis reveals that free-form CoT is insufficient and sometimes counterproductive for structured function-calling tasks. To address this, we introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings. Experimental results show that our method reduces tool-use errors, achieving 3–12% relative improvements over strong baselines across diverse model series and approaches. Moreover, our framework enhances the robustness, interpretability, and transparency of tool-using agents, advancing the development of more reliable AI assistants for real-world applications.