Jianmin Guo

2025

pdf bib abs
Superpose Task-specific Features for Model Merging
Haiquan Qiu | You Wu | Dong Li | Jianmin Guo | Quanming Yao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Model merging enables powerful capabilities in neural networks without requiring additional training. In this paper, we introduce a novel perspective on model merging by leveraging the fundamental mechanisms of neural network representation. Our approach is motivated by the linear representation hypothesis, which states that neural networks encode information through linear combinations of feature vectors. We propose a method that superposes task-specific features from individual models into a merged model. Our approach specifically targets linear transformation matrices, which are crucial for feature activation and extraction in deep networks. By formulating the merging process as a linear system, we can preserve output feature directions from individual models and create merged models that effectively maintain multi-task capabilities compared to existing methods. Extensive experiments across diverse benchmarks and models demonstrate that our method outperforms existing techniques.

pdf bib abs
ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations
Yuejin Xie | Youliang Yuan | Wenxuan Wang | Fan Mo | Jianmin Guo | Pinjia He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

LLMs are evolving into assistants that leverage tools, significantly expanding their capabilities but also introducing critical safety risks. Current models exhibit notable vulnerabilities, particularly in maintaining safety during multi-step tool interactions and in scenarios involving indirect harm. This paper introduces ToolSafety, a safety fine-tuning dataset designed to address these limitations. ToolSafety comprises 5,668 direct harm samples, 4,311 indirect harm samples, and 4,311 multi-step samples. Key features include support for multi-step safety through synthesized trajectories and realistic, context-aware sample generation. We fine-tuned LLaMA3.1-8B-Instruct and Qwen2.5-7B-Instruct using ToolSafety. Experimental results demonstrate that these models effectively maintain safety in multi-step and indirect harm scenarios. Further analysis into superficial alignment across different decoding strategies, languages, and jailbreak prompts indicates that while some risks persist, the issue is less severe than in multi-step settings. Overall, our approach significantly improves safety across various scenarios with small impact on helpfulness, positioning ToolSafety as a valuable resource for building safer tool-using AI systems.

Co-authors

You Wu 1

Yuejin Xie 1

Quanming Yao 1

Youliang Yuan 1

Venues

emnlp2

Fix author