Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
Keigo Shibata, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Wataru Ikeda, Jun Suzuki
Abstract
This paper discusses the internal behavior of Transformer language models.Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large “jump” in the angular distance occurring in or around the final Transformer layer.To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training.Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers.Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.- Anthology ID:
- 2026.findings-eacl.64
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1236–1262
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.64/
- DOI:
- Cite (ACL):
- Keigo Shibata, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Wataru Ikeda, and Jun Suzuki. 2026. Suppressing Final Layer Hidden State Jumps in Transformer Pretraining. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1236–1262, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Suppressing Final Layer Hidden State Jumps in Transformer Pretraining (Shibata et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.64.pdf