Jiawen Shi


2026

LLM-based agents are rapidly being deployed in real-world applications (e.g., digital assistants and customer service), making safety a critical concern. However, in multi-turn, tool-augmented settings, dynamic user interactions, external tool use, and unintended harmful behaviors make robust safety assurance challenging. To address these challenges, we propose **SafeAgent**, a framework that improves agent safety through fully automated synthetic data generation. SafeAgent introduces (1) an open and extensible threat model OTS that decomposes agent risk into instruction-, context-, and action-induced sources to ground safety analysis and alignment; and (2) an automated pipeline that instantiates OTS to surface scenario-specific failure modes, stress-test agents, and generate self-reflective safe responses—without hazardous real-world data collection. We evaluate SafeAgent on two safety benchmarks and one real-world terminal task. Across four widely used open-source models, SafeAgent improves safety performance by 45% on average and delivers a 28.91% gain on the real-world task, outperforming state-of-the-art closed-source models. These results highlight the practical advancement and scalability of SafeAgent in building safer LLM agents for real-world deployment.

2025

Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives—effectiveness and utility—and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).