Jingming Zhuo


2024

pdf
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
Zehui Chen | Weihua Du | Wenwei Zhang | Kuikun Liu | Jiangning Liu | Miao Zheng | Jingming Zhuo | Songyang Zhang | Dahua Lin | Kai Chen | Feng Zhao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool-utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available.

2023

pdf
Out-of-Distribution Generalization in Natural Language Processing: Past, Present, and Future
Linyi Yang | Yaoxian Song | Xuan Ren | Chenyang Lyu | Yidong Wang | Jingming Zhuo | Lingqiao Liu | Jindong Wang | Jennifer Foster | Yue Zhang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Machine learning (ML) systems in natural language processing (NLP) face significant challenges in generalizing to out-of-distribution (OOD) data, where the test distribution differs from the training data distribution. This poses important questions about the robustness of NLP models and their high accuracy, which may be artificially inflated due to their underlying sensitivity to systematic biases. Despite these challenges, there is a lack of comprehensive surveys on the generalization challenge from an OOD perspective in natural language understanding. Therefore, this paper aims to fill this gap by presenting the first comprehensive review of recent progress, methods, and evaluations on this topic. We further discuss the challenges involved and potential future research directions. By providing convenient access to existing work, we hope this survey will encourage future research in this area.