Suppressing Final Layer Hidden State Jumps in Transformer Pretraining

Keigo Shibata; Kazuki Yano; Ryosuke Takahashi; Jaesung Lee; Wataru Ikeda; Jun Suzuki

Suppressing Final Layer Hidden State Jumps in Transformer Pretraining

Keigo Shibata, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Wataru Ikeda, Jun Suzuki

Abstract

This paper discusses the internal behavior of Transformer language models.Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large “jump” in the angular distance occurring in or around the final Transformer layer.To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training.Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers.Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.

Anthology ID:: 2026.findings-eacl.64
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1236–1262
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.64/
DOI:
Bibkey:
Cite (ACL):: Keigo Shibata, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Wataru Ikeda, and Jun Suzuki. 2026. Suppressing Final Layer Hidden State Jumps in Transformer Pretraining. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1236–1262, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Suppressing Final Layer Hidden State Jumps in Transformer Pretraining (Shibata et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.64.pdf
Checklist:: 2026.findings-eacl.64.checklist.pdf

PDF Cite Search Checklist Fix data