AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

Huishuai Zhang, Bohan Wang, Luoxin Chen


Abstract
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.
Anthology ID:
2025.emnlp-main.543
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10730–10749
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.543/
DOI:
Bibkey:
Cite (ACL):
Huishuai Zhang, Bohan Wang, and Luoxin Chen. 2025. AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10730–10749, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.543.pdf
Checklist:
 2025.emnlp-main.543.checklist.pdf