PerMemSafe: Benchmarking Implicit Personalized Safety of Long Horizon Self-Evolving Agents
Hengyu An, Minxi Li, Naen Xu, Chunyi Zhou, Xiaogang Xu, Tianyu Du, Jinbao Li, Shouling Ji
Abstract
Self-evolving agents achieve personalization by accumulating user-specific memories over long horizons. This capability, however, introduces novel safety risks, as responses that are generally safe may become harmful in user-specific contexts. Such safety-relevant contexts often emerge implicitly and evolve over time during long-horizon conversations, rendering traditional context-independent safety evaluations insufficient. To address this, we formally define Implicit Personalized Safety and present PerMemSafe, the first benchmark for evaluating implicit personalized safety of self-evolving agents in long-horizon interactions. Empirical results reveal significant limitations of existing self-evolving agents, with even the strongest achieving only around 50% safety rate, highlighting systematic failures in reasoning about personalized safety risks. To mitigate this, we propose SentinelMem, an active risk-aware memory framework that explicitly models personalized risk inference and memory evolution. Experiments show that SentinelMem improves implicit personalized safety by 23.8% over prior memory frameworks while maintaining helpfulness in long-horizon interactions.- Anthology ID:
- 2026.findings-acl.320
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6415–6433
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.320/
- DOI:
- Cite (ACL):
- Hengyu An, Minxi Li, Naen Xu, Chunyi Zhou, Xiaogang Xu, Tianyu Du, Jinbao Li, and Shouling Ji. 2026. PerMemSafe: Benchmarking Implicit Personalized Safety of Long Horizon Self-Evolving Agents. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6415–6433, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- PerMemSafe: Benchmarking Implicit Personalized Safety of Long Horizon Self-Evolving Agents (An et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.320.pdf