ToolCPT: Improving Tool Utilization in LLM Agents via Continuous Pre-training

Yifan Yang, Jinghui Lu, Evadeng, Ao Yang, Peijie Yu, TingHao YU, Feng Zhang


Abstract
Autonomous agents powered by large language models (LLM-based agents) are capable of using off-the-shelf tools to interact with the environment, solve real-world problems, and boost work efficiency. However, current approaches to enhancing tool use for LLM-based agents primarily focus on post-training fine-tuning or test-time context extension. These methods overlook the fundamental tool knowledge acquisition during the early training phase, where models actually learn and internalize core knowledge representations, restricting model performance on out-of-distribution tool usage. To solve such a problem, we introduce enhancing tool knowledge for LLM-based agents during continuous pre-training (ToolCPT). We identify and bridge a key gap in current LLM training by shifting focus from tool-calling patterns to deep internalization of core tool-knowledge representations. We begin by curating 5.1 million code artifacts from large-scale, high-quality code repositories. These artifacts are selected based on a set of criteria that defines a usable "proxy agent tool", thereby forming a comprehensive agent tool library. For each proxy tool, we then create a detailed playbook covering implementation specifications, core functionalities, interaction protocols with other tools, and illustrative positive and negative examples. This process yields a large-scale tool knowledge corpus comprising 18 billion tokens, which is used to continuously pre-train our model. Experiments show our playbook-enhanced corpus catalyzes deep knowledge internalization, driving the model to notable performance gains on multiple standard benchmarks.
Anthology ID:
2026.findings-acl.776
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15830–15856
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.776/
DOI:
Bibkey:
Cite (ACL):
Yifan Yang, Jinghui Lu, Evadeng, Ao Yang, Peijie Yu, TingHao YU, and Feng Zhang. 2026. ToolCPT: Improving Tool Utilization in LLM Agents via Continuous Pre-training. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15830–15856, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
ToolCPT: Improving Tool Utilization in LLM Agents via Continuous Pre-training (Yang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.776.pdf
Checklist:
 2026.findings-acl.776.checklist.pdf