Shenyang Deng

2026

has recently shown promising results in LLM training. In this work, we study how to further improve . We argue that ’s orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose . preserves ’s ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing variants. For example, on LLaMA pretraining on the C4 dataset, reduces perplexity by up to 0.98 compared to . We further theoretically show that corresponds to steepest descent under the Schatten-q norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of is available at https://github.com/TDCSZ327/HTmuon.

Co-authors

Shuhua Yu 1

Venues

Findings1

Fix author