Establishing a Scale for Kullback-Leibler Divergence in Language Models Across Various Settings
Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira
Abstract
Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space, as measured by the scaling behavior of KL divergence, are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.- Anthology ID:
- 2026.findings-acl.1163
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 23223–23248
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1163/
- DOI:
- Cite (ACL):
- Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, and Hidetoshi Shimodaira. 2026. Establishing a Scale for Kullback-Leibler Divergence in Language Models Across Various Settings. In Findings of the Association for Computational Linguistics: ACL 2026, pages 23223–23248, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Establishing a Scale for Kullback-Leibler Divergence in Language Models Across Various Settings (Kishino et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1163.pdf