ToolCPT: Improving Tool Utilization in LLM Agents via Continuous Pre-training
Yifan Yang, Jinghui Lu, Evadeng, Ao Yang, Peijie Yu, TingHao YU, Feng Zhang
Abstract
Autonomous agents powered by large language models (LLM-based agents) are capable of using off-the-shelf tools to interact with the environment, solve real-world problems, and boost work efficiency. However, current approaches to enhancing tool use for LLM-based agents primarily focus on post-training fine-tuning or test-time context extension. These methods overlook the fundamental tool knowledge acquisition during the early training phase, where models actually learn and internalize core knowledge representations, restricting model performance on out-of-distribution tool usage. To solve such a problem, we introduce enhancing tool knowledge for LLM-based agents during continuous pre-training (ToolCPT). We identify and bridge a key gap in current LLM training by shifting focus from tool-calling patterns to deep internalization of core tool-knowledge representations. We begin by curating 5.1 million code artifacts from large-scale, high-quality code repositories. These artifacts are selected based on a set of criteria that defines a usable "proxy agent tool", thereby forming a comprehensive agent tool library. For each proxy tool, we then create a detailed playbook covering implementation specifications, core functionalities, interaction protocols with other tools, and illustrative positive and negative examples. This process yields a large-scale tool knowledge corpus comprising 18 billion tokens, which is used to continuously pre-train our model. Experiments show our playbook-enhanced corpus catalyzes deep knowledge internalization, driving the model to notable performance gains on multiple standard benchmarks.- Anthology ID:
- 2026.findings-acl.776
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15830–15856
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.776/
- DOI:
- Cite (ACL):
- Yifan Yang, Jinghui Lu, Evadeng, Ao Yang, Peijie Yu, TingHao YU, and Feng Zhang. 2026. ToolCPT: Improving Tool Utilization in LLM Agents via Continuous Pre-training. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15830–15856, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- ToolCPT: Improving Tool Utilization in LLM Agents via Continuous Pre-training (Yang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.776.pdf