Jungwoo Park
2025
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
Jungwoo Park
|
Taewhoo Lee
|
Chanwoong Yoon
|
Hyeon Hwang
|
Jaewoo Kang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce **Outlier-Safe Pre-Training (OSP)**, a practical guideline that proactively prevents outlier formation, rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency, (2) Single-Scale RMSNorm, preventing channel-wise amplification, and (3) a learnable embedding projection, redistributing activation magnitudes. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (versus 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment.
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information
Yein Park
|
Chanwoong Yoon
|
Jungwoo Park
|
Minbyul Jeong
|
Jaewoo Kang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads that primarily handle temporal knowledge, through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model’s ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions (“In 2004”) but also textual aliases (“In the year ...”), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.
Search
Fix author
Co-authors
- Jaewoo Kang 2
- Chanwoong Yoon 2
- Hyeon Hwang 1
- Minbyul Jeong 1
- Taewhoo Lee 1
- show all...
Venues
- acl2