Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Zvi Topol


Abstract
Large language models (LLMs) are increasingly deployed in wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails.Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vulnerability. Our approach models the “time-to-jailbreak” as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. We evaluate three LLMs against a sub-set of prompts from the HarmBench dataset spanning three attack categories. Our analysis reveals that models exhibit distinct vulnerability profiles: while one model demonstrates rapid degradation under iterative attacks, the wo other models show consistent moderate vulnerability. Our framework provides actionable insights for model and LLM applicaiton developers and establishes survival analysis as a rigorous methodology for LLM safety evaluation.
Anthology ID:
2026.trustnlp-main.5
Volume:
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
64–72
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.5/
DOI:
Bibkey:
Cite (ACL):
Zvi Topol. 2026. Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 64–72, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis (Topol, TrustNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.5.pdf