ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations
Yuejin Xie, Youliang Yuan, Wenxuan Wang, Fan Mo, Jianmin Guo, Pinjia He
Abstract
LLMs are evolving into assistants that leverage tools, significantly expanding their capabilities but also introducing critical safety risks. Current models exhibit notable vulnerabilities, particularly in maintaining safety during multi-step tool interactions and in scenarios involving indirect harm. This paper introduces ToolSafety, a safety fine-tuning dataset designed to address these limitations. ToolSafety comprises 5,668 direct harm samples, 4,311 indirect harm samples, and 4,311 multi-step samples. Key features include support for multi-step safety through synthesized trajectories and realistic, context-aware sample generation. We fine-tuned LLaMA3.1-8B-Instruct and Qwen2.5-7B-Instruct using ToolSafety. Experimental results demonstrate that these models effectively maintain safety in multi-step and indirect harm scenarios. Further analysis into superficial alignment across different decoding strategies, languages, and jailbreak prompts indicates that while some risks persist, the issue is less severe than in multi-step settings. Overall, our approach significantly improves safety across various scenarios with small impact on helpfulness, positioning ToolSafety as a valuable resource for building safer tool-using AI systems.- Anthology ID:
- 2025.emnlp-main.714
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 14146–14167
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.714/
- DOI:
- Cite (ACL):
- Yuejin Xie, Youliang Yuan, Wenxuan Wang, Fan Mo, Jianmin Guo, and Pinjia He. 2025. ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14146–14167, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations (Xie et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.714.pdf