2025
pdf
bib
abs
Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation
Dongsheng Zhu
|
Weixian Shi
|
Zhengliang Shi
|
Zhaochun Ren
|
Shuaiqiang Wang
|
Lingyong Yan
|
Dawei Yin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Large Language Models (LLMs) demonstrate remarkable capabilities, their ability to autonomously execute complex real-world tasks remains limited. Accordingly, tool learning has emerged to enable LLMs to effectively leverage external tools to extend their capabilities. Current tool-learning paradigms like CoT/ReAct employ sequential tool invocation but suffer from constrained perception and inadequate task planning. Alternative approaches using search-based decision trees incur substantial computational overhead. To address these limitations, we propose DTA-Llama (Divide-Then-Aggregate Llama), a novel parallel tool invocation framework featuring: (1) A Directed Acyclic Graph (DAG) structure that transformed from traditional tree-based tool search paths, enabling parallel execution and contributing high-quality training data; (2) A process-thread-inspired inference mechanism that iteratively decomposes tasks into parallel tool-using subtasks while aggregating results for subsequent decisions. Experimental results show that our approach substantially enhances task performance while reducing token consumption and inference time. Llama2-7B, using our method, is comparable to the official parallel function calling method of GPT-3.5. The relevant code, dataset, and model weights are available at https://corn0205.github.io/.
pdf
bib
abs
Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
Zhengliang Shi
|
Yuhan Wang
|
Lingyong Yan
|
Pengjie Ren
|
Shuaiqiang Wang
|
Dawei Yin
|
Zhaochun Ren
Findings of the Association for Computational Linguistics: ACL 2025
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
2024
pdf
bib
abs
Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering
Zhengliang Shi
|
Shuo Zhang
|
Weiwei Sun
|
Shen Gao
|
Pengjie Ren
|
Zhumin Chen
|
Zhaochun Ren
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-Hop Question Answering (MHQA) task presents a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair into retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method. To further facilitate future research, we have collected a dataset that traces the reasoning process.
pdf
bib
abs
MAIR: A Massive Benchmark for Evaluating Instructed Retrieval
Weiwei Sun
|
Zhengliang Shi
|
Wu Jiu Long
|
Lingyong Yan
|
Xinyu Ma
|
Yiding Liu
|
Min Cao
|
Dawei Yin
|
Zhaochun Ren
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We benchmark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models generally achieve superior performance compared to non-instruction-tuned models on MAIR Additionally, our results suggest that current instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available at https://github.com/sunnweiwei/Mair.
pdf
bib
360∘REA: Towards A Reusable Experience Accumulation with 360∘ Assessment for Multi-Agent System
Shen Gao
|
Hao Li
|
Zhengliang Shi
|
Chengrui Huang
|
Quan Tu
|
Shuo Shang
|
Zhiliang Tian
|
Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2024
pdf
bib
abs
Learning to Use Tools via Cooperative and Interactive Agents
Zhengliang Shi
|
Shen Gao
|
Xiuyi Chen
|
Yue Feng
|
Lingyong Yan
|
Haibo Shi
|
Dawei Yin
|
Pengjie Ren
|
Suzan Verberne
|
Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2024
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. Existing methods employ one single LLM-based agent to iteratively select and execute tools, thereafter incorporating execution results into the next action prediction. Despite their progress, these methods suffer from performance degradation when addressing practical tasks due to: (1) the pre-defined pipeline with restricted flexibility to calibrate incorrect actions, and (2) the struggle to adapt a general LLM-based agent to perform a variety of specialized actions. To mitigate these problems, we propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. ConAgents introduces two communication protocols to enable the flexible cooperation of agents. To effectively generalize the ConAgents into open-source models, we also propose specialized action distillation, enhancing their ability to perform specialized actions in our framework. Our extensive experiments on three datasets show that the LLMs, when equipped with the ConAgents, outperform baselines with substantial improvement (i.e., up to 14% higher success rate).
2023
pdf
bib
abs
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue
Zhengliang Shi
|
Weiwei Sun
|
Shuo Zhang
|
Zhen Zhang
|
Pengjie Ren
|
Zhaochun Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating open-domain dialogue systems is challenging for reasons such as the one-to-many problem, i.e., many appropriate responses other than just the golden response. As of now, automatic evaluation methods need better consistency with humans, while reliable human evaluation can be time- and cost-intensive. To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem. Specifically, RADE explicitly compares reference and the candidate response to predict their overall scores. Moreover, an auxiliary response generation task enhances prediction via a shared encoder. To support RADE, we extend three datasets with additional rated responses other than just a golden response by human annotation. Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method, where Pearson, Spearman, and Kendall correlations with human evaluation outperform state-of-the-art baselines.
pdf
bib
abs
Towards a Unified Framework for Reference Retrieval and Related Work Generation
Zhengliang Shi
|
Shen Gao
|
Zhen Zhang
|
Xiuying Chen
|
Zhumin Chen
|
Pengjie Ren
|
Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2023
The task of related work generation aims to generate a comprehensive survey of related research topics automatically, saving time and effort for authors. Existing methods simplify this task by using human-annotated references in a large-scale scientific corpus as information sources, which is time- and cost-intensive. To this end, we propose a Unified Reference Retrieval and Related Work Generation Model (UR3WG), which combines reference retrieval and related work generation processes in a unified framework based on the large language model (LLM). Specifically, UR3WG first leverages the world knowledge of LLM to extend the abstract and generate the query for the subsequent retrieval stage. Then a lexicon-enhanced dense retrieval is proposed to search relevant references, where an importance-aware representation of the lexicon is introduced. We also propose multi-granularity contrastive learning to optimize our retriever. Since this task is not simply summarizing the main points in references, it should analyze the complex relationships and present them logically. We propose an instruction-tuning method to leverage LLM to generate related work. Extensive experiments on two wide-applied datasets demonstrate that our model outperforms the state-of-the-art baselines in both generation and retrieval metrics.