PerMemSafe: Benchmarking Implicit Personalized Safety of Long Horizon Self-Evolving Agents

Hengyu An, Minxi Li, Naen Xu, Chunyi Zhou, Xiaogang Xu, Tianyu Du, Jinbao Li, Shouling Ji


Abstract
Self-evolving agents achieve personalization by accumulating user-specific memories over long horizons. This capability, however, introduces novel safety risks, as responses that are generally safe may become harmful in user-specific contexts. Such safety-relevant contexts often emerge implicitly and evolve over time during long-horizon conversations, rendering traditional context-independent safety evaluations insufficient. To address this, we formally define Implicit Personalized Safety and present PerMemSafe, the first benchmark for evaluating implicit personalized safety of self-evolving agents in long-horizon interactions. Empirical results reveal significant limitations of existing self-evolving agents, with even the strongest achieving only around 50% safety rate, highlighting systematic failures in reasoning about personalized safety risks. To mitigate this, we propose SentinelMem, an active risk-aware memory framework that explicitly models personalized risk inference and memory evolution. Experiments show that SentinelMem improves implicit personalized safety by 23.8% over prior memory frameworks while maintaining helpfulness in long-horizon interactions.
Anthology ID:
2026.findings-acl.320
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6415–6433
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.320/
DOI:
Bibkey:
Cite (ACL):
Hengyu An, Minxi Li, Naen Xu, Chunyi Zhou, Xiaogang Xu, Tianyu Du, Jinbao Li, and Shouling Ji. 2026. PerMemSafe: Benchmarking Implicit Personalized Safety of Long Horizon Self-Evolving Agents. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6415–6433, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PerMemSafe: Benchmarking Implicit Personalized Safety of Long Horizon Self-Evolving Agents (An et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.320.pdf
Checklist:
 2026.findings-acl.320.checklist.pdf