Maryam Fatima

2025

pdf bib abs
A Proactive Reliability Metric for Detecting Failures in Language Model Training
Maryam Fatima
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Training large language models (LLMs) at scale is fraught with instabilities that can lead to catastrophic failures, wasting millions of dollars in compute resources. Current approaches rely on reactive interventions like checkpointing, which only mitigate failures after detection. We introduce the R-Metric, a proactive reliability metric that combines signals from hardware monitoring (𝜆), training dynamics (𝜎²), and model performance (𝛥 L) to predict failures before they occur. Through extensive experiments across 720 simulated runs and real-world validation on diverse hardware (NVIDIA T4/L4 GPUs) and model architectures (Llama 3.2-1B, GPT-2 Large, Qwen3-0.6B, Liquid AI LFM2-700M), we demonstrate that the R-Metric achieves 0.973 F1-Score in simulation and perfect 1.00 F1-Score in real-world deployment with an average lead time of 255 steps (12.8 minutes for small models, scaling to 2-8 minutes at production training speeds), enabling preemptive intervention. Importantly, our optimized weights (𝜆=0.10, 𝜎²=0.45, 𝛥 L=0.70) transfer across architectures with less than 3% performance degradation, eliminating expensive retuning. The metric’s lightweight computational overhead (1.8% training time increase) makes it immediately deployable for resource-constrained organizations—academic labs, startups, and open-source communities—democratizing access to enterprise-grade reliability monitoring.

pdf bib abs
FIRMA: Bidirectional Formal-Informal Mathematical Language Alignment with Proof-Theoretic Grounding
Maryam Fatima
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)

While large language models excel at gener- ating plausible mathematical text, they often produce subtly incorrect formal translations that violate proof-theoretic constraints. We present FIRMA (Formal-Informal Reasoning in Mathematical Alignment), a bidirectional translation system between formal and informal mathematical language that leverages proof-theoretic interpretability hierarchies and specialized architectural components for proof preservation. Unlike existing approaches that treat this as pure sequence-to-sequence translation, FIRMA introduces a hierarchical architecture with complexity-aware routing, proof-preserving attention mechanisms, and multi-objective training that balances formal correctness with natural readability. Through progressive complexity training on curated datasets from Lean 4 and formal mathematics repositories, we evaluate FIRMA on 200 translation samples across complexity levels and compare against two baseline systems. Our analysis shows statistically significant improvements of 277.8% over BFS-Prover- V1-7B and 6307.5% over REAL-Prover on overall translation quality metrics. Ablation studies on 50 samples demonstrate that each architectural component contributes substan- tially to performance, with removal of any component resulting in 83-85% performance degradation. We release our code at https://github.com/smfatima3/FIRMA

Co-authors

Venues

Fix author