Xuda Zhi

2026

Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions.However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications.In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries.To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions.Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by **15.8%–243.7%** relative to stateless baselines.We further provide mechanistic evidence for intent legitimation from internal representation space, and propose a lightweight detection–reflection method that effectively reduces safety degradation.Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. **WARNING:** This paper may contain harmful content.

pdf bib abs

Experience-driven self-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self-curated experience introduces underexplored safety risks. In this study, we investigate how experience accumulation and utilization in self-evolving agents affect safety performance across web-based and embodied environments. Notably, experience gathered solely from benign tasks can still compromise safety in high-risk scenarios. Further analysis attributes this degradation to the execution-oriented nature of accumulated experience, which reinforces agents’ tendency to act rather than refuse. In more realistic settings where agents encounter both benign and harmful tasks, refusal-related experience mitigates safety decline but induces over-refusal, revealing a fundamental safety–utility trade-off. Overall, our findings expose inherent limitations of current self-evolving agents and call for more principled strategies to ensure safe and reliable adaptation.

Co-authors

Venues

ACL1
Findings1

Fix author