Wu Liu
2025
ACEBench: A Comprehensive Evaluation of LLM Tool Usage
Chen Chen
|
Xinlong Hao
|
Weiwen Liu
|
Xu Huang
|
Xingshan Zeng
|
Shuai Yu
|
Dexun Li
|
Yuefeng Huang
|
Xiangcheng Liu
|
Wang Xinzhi
|
Wu Liu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. “Normal” evaluates tool usage in basic scenarios; “Special” evaluates tool usage in situations with ambiguous or incomplete instructions; “Agent” evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
2006
France Telecom R&D Beijing Word Segmenter for Sighan Bakeoff 2006
Wu Liu
|
Heng Li
|
Yuan Dong
|
Nan He
|
Haitao Luo
|
Haila Wang
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing
2005
Chinese Word Segmentation in FTRD Beijing
Heng Li
|
Yuan Dong
|
Xinnian Mao
|
Haila Wang
|
Wu Liu
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing