A Proactive Reliability Metric for Detecting Failures in Language Model Training

Maryam Fatima

A Proactive Reliability Metric for Detecting Failures in Language Model Training

Abstract

Training large language models (LLMs) at scale is fraught with instabilities that can lead to catastrophic failures, wasting millions of dollars in compute resources. Current approaches rely on reactive interventions like checkpointing, which only mitigate failures after detection. We introduce the R-Metric, a proactive reliability metric that combines signals from hardware monitoring (𝜆), training dynamics (𝜎²), and model performance (𝛥 L) to predict failures before they occur. Through extensive experiments across 720 simulated runs and real-world validation on diverse hardware (NVIDIA T4/L4 GPUs) and model architectures (Llama 3.2-1B, GPT-2 Large, Qwen3-0.6B, Liquid AI LFM2-700M), we demonstrate that the R-Metric achieves 0.973 F1-Score in simulation and perfect 1.00 F1-Score in real-world deployment with an average lead time of 255 steps (12.8 minutes for small models, scaling to 2-8 minutes at production training speeds), enabling preemptive intervention. Importantly, our optimized weights (𝜆=0.10, 𝜎²=0.45, 𝛥 L=0.70) transfer across architectures with less than 3% performance degradation, eliminating expensive retuning. The metric’s lightweight computational overhead (1.8% training time increase) makes it immediately deployable for resource-constrained organizations—academic labs, startups, and open-source communities—democratizing access to enterprise-grade reliability monitoring.

Anthology ID:: 2025.emnlp-industry.193
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2897–2913
Language:
URL:: https://preview.aclanthology.org/dashboard/2025.emnlp-industry.193/
DOI:
Bibkey:
Cite (ACL):: Maryam Fatima. 2025. A Proactive Reliability Metric for Detecting Failures in Language Model Training. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2897–2913, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: A Proactive Reliability Metric for Detecting Failures in Language Model Training (Fatima, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/dashboard/2025.emnlp-industry.193.pdf

PDF Cite Search Fix data