Xuda Zhi
2026
When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
Jiahe Guo | Xiangran Guo | Yulin Hu | Zimo Long | Xingyu Sui | Xuda Zhi | Yongbo Huang | Hao He | Weixiang Zhao | Yanyan Zhao | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiahe Guo | Xiangran Guo | Yulin Hu | Zimo Long | Xingyu Sui | Xuda Zhi | Yongbo Huang | Hao He | Weixiang Zhao | Yanyan Zhao | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions.However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications.In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries.To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions.Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by **15.8%–243.7%** relative to stateless baselines.We further provide mechanistic evidence for intent legitimation from internal representation space, and propose a lightweight detection–reflection method that effectively reduces safety degradation.Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. **WARNING:** This paper may contain harmful content.
On Safety Risks in Experience-Driven Self-Evolving Agents
Weixiang Zhao | Yichen Zhang | Yingshuo Wang | Yang Deng | Yanyan Zhao | Xuda Zhi | Yongbo Huang | Hao He | Wanxiang Che | Bing Qin | Ting Liu
Findings of the Association for Computational Linguistics: ACL 2026
Weixiang Zhao | Yichen Zhang | Yingshuo Wang | Yang Deng | Yanyan Zhao | Xuda Zhi | Yongbo Huang | Hao He | Wanxiang Che | Bing Qin | Ting Liu
Findings of the Association for Computational Linguistics: ACL 2026
Experience-driven self-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self-curated experience introduces underexplored safety risks. In this study, we investigate how experience accumulation and utilization in self-evolving agents affect safety performance across web-based and embodied environments. Notably, experience gathered solely from benign tasks can still compromise safety in high-risk scenarios. Further analysis attributes this degradation to the execution-oriented nature of accumulated experience, which reinforces agents’ tendency to act rather than refuse. In more realistic settings where agents encounter both benign and harmful tasks, refusal-related experience mitigates safety decline but induces over-refusal, revealing a fundamental safety–utility trade-off. Overall, our findings expose inherent limitations of current self-evolving agents and call for more principled strategies to ensure safe and reliable adaptation.