Zihan Zheng
2025
PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning
Zihan Zheng
|
Tianle Cui
|
Chuwen Xie
|
Jiahui Pan
|
Qianglong Chen
|
Lewei He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
One of the research focuses of large language models (LLMs) is the ability to generate action plans. Recent studies have revealed that the performance of LLMs can be significantly improved by integrating external tools. Based on this, we propose a benchmark framework called PlanningArena, which aims to simulate real application scenarios and provide a series of apps and API tools that may be involved in the actual planning process. This framework adopts a modular task structure and combines user portrait analysis to evaluate the ability of LLMs in correctly selecting tools, logical reasoning in complex scenarios, and parsing user information. In addition, we deeply diagnose the task execution effect of LLMs from both macro and micro levels. The experimental results show that even the most outstanding GPT-4o and DeepSeekV3 models only achieved a total score of 56.5% and 41.9% in PlanningArena, respectively, indicating that current LLMs still face challenges in logical reasoning, context memory, and tool calling when dealing with different structures, scenarios, and their complexity. Through this benchmark, we further explore the path to optimize LLMs to perform planning tasks.