2025
pdf
bib
abs
UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations
Fengran Mo
|
Yifan Gao
|
Chuan Meng
|
Xin Liu
|
Zhuofeng Wu
|
Kelong Mao
|
Zhengyang Wang
|
Pei Chen
|
Zheng Li
|
Xian Li
|
Bing Yin
|
Meng Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
pdf
bib
abs
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates
Hy Dang
|
Tianyi Liu
|
Zhuofeng Wu
|
Jingfeng Yang
|
Haoming Jiang
|
Tao Yang
|
Pei Chen
|
Zhengyang Wang
|
Helen Wang
|
Huasheng Li
|
Bing Yin
|
Meng Jiang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities, yet they often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent. These issues often stem from an incomplete understanding of user goals and inadequate comprehension of tool documentation. While Chain-of-Thought (CoT) prompting has proven effective for enhancing reasoning in general contexts, our analysis reveals that free-form CoT is insufficient and sometimes counterproductive for structured function-calling tasks. To address this, we introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings. Experimental results show that our method reduces tool-use errors, achieving 3–12% relative improvements over strong baselines across diverse model series and approaches. Moreover, our framework enhances the robustness, interpretability, and transparency of tool-using agents, advancing the development of more reliable AI assistants for real-world applications.
pdf
bib
abs
DrAgent: Empowering Large Language Models as Medical Agents for Multi-hop Medical Reasoning
Fenglin Liu
|
Zheng Li
|
Hongjian Zhou
|
Qingyu Yin
|
Jingfeng Yang
|
Xin Liu
|
Zhengyang Wang
|
Xianfeng Tang
|
Shiyang Li
|
Xiang He
|
Ruijie Wang
|
Bing Yin
|
Xiao Gu
|
Lei Clifton
|
David A. Clifton
Findings of the Association for Computational Linguistics: EMNLP 2025
Although large language models (LLMs) have demonstrated outperforming human experts in medical examinations, it remains challenging to adopt LLMs in real-world clinical decision-making that typically involves multi-hop medical reasoning. Common practices include prompting commercial LLMs and fine-tuning LLMs on medical data. However, in the clinical domain, using commercial LLMs raises privacy concerns regarding sensitive patient data. Fine-tuning competitive medical LLMs for different tasks usually requires extensive data and computing resources, which are difficult to acquire, especially in medical institutions with limited infrastructure. We propose DrAgent, which can build LLMs as agents to deliver accurate medical decision-making and reasoning. In implementation, we take a lightweight LLM as the backbone to collaborate with diverse clinical tools. To make efficient use of data, DrAgent introduces recursive curriculum learning to optimize the LLM in an easy-to-hard progression. The results show that our approach achieves competitive performance on diverse datasets.
pdf
bib
abs
Can Language Models Follow Multiple Turns of Entangled Instructions?
Chi Han
|
Xin Liu
|
Haodong Wang
|
Shiyang Li
|
Jingfeng Yang
|
Haoming Jiang
|
Zhengyang Wang
|
Qingyu Yin
|
Liang Qiu
|
Changlong Yu
|
Yifan Gao
|
Zheng Li
|
Bing Yin
|
Jingbo Shang
|
Heng Ji
Findings of the Association for Computational Linguistics: EMNLP 2025
Despite of significant achievements in improving instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflict instructions remains a considerable challenge. Real-world scenarios often require the consistency across multiple instructions over time, such as secret privacy, presonal preferences, and prioritization, so we demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs’ capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in a total of nine capability categories, including statics and dynamics, reasoning and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to effectively integrate multiple related instructions. These findings highlight critical areas for improvement in the complex real-world tasks involving multi-turn instructions.
pdf
bib
abs
ALERT: An LLM-powered Benchmark for Automatic Evaluation of Recommendation Explanations
Yichuan Li
|
Xinyang Zhang
|
Chenwei Zhang
|
Mao Li
|
Tianyi Liu
|
Pei Chen
|
Yifan Gao
|
Kyumin Lee
|
Kaize Ding
|
Zhengyang Wang
|
Zhihan Zhang
|
Jingbo Shang
|
Xian Li
|
Trishul Chilimbi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recommendation explanation systems have become increasingly vital with the widespread adoption of recommender systems. However, existing recommendation explanation evaluation benchmarks suffer from limited item diversity, impractical user profiling requirements, and unreliable and unscalable evaluation protocols. We present ALERT, a model-agnostic recommendation explanation evaluation benchmark. The benchmark comprises three main contributions: 1) a diverse dataset encompassing 15 Amazon e-commerce categories with 2,761 user-item interactions, incorporating implicit preferences through purchase histories;2) two novel LLM-powered automatic evaluators that enable scalable and human-preference aligned evaluation of explanations; and 3) a robust divide-and-aggregate approach that synthesizes multiple LLM judgments, achieving 70% concordance with expert human evaluation and substantially outperforming existing methods.ALERT facilitates comprehensive evaluation of recommendation explanations across diverse domains, advancing the development of more effective explanation systems.
pdf
bib
abs
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Yuchen Zhuang
|
Jingfeng Yang
|
Haoming Jiang
|
Xin Liu
|
Kewei Cheng
|
Sanket Lokegaonkar
|
Yifan Gao
|
Qing Ping
|
Tianyi Liu
|
Binxuan Huang
|
Zheng Li
|
Zhengyang Wang
|
Pei Chen
|
Ruijie Wang
|
Rongzhi Zhang
|
Nasser Zalmout
|
Priyanka Nigam
|
Bing Yin
|
Chao Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.