Efficient Training of Language Models with Compact and Consistent Next Token Distributions

Ashutosh Sathe, Sunita Sarawagi


Abstract
Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed n-gram distribution. Previous studies have proposed corpus-level n-gram statistics as a regularizer; however, the construction and querying of such n-grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training.We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete n-gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the n-gram regularized model and our approximation yield substantial improvements in model quality and convergence rate compared to existing methods. Furthermore, our approximation facilitates scalability of gains to larger datasets and models compared to the straightforward n-gram regularization method.
Anthology ID:
2024.findings-acl.717
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12051–12064
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.717/
DOI:
10.18653/v1/2024.findings-acl.717
Bibkey:
Cite (ACL):
Ashutosh Sathe and Sunita Sarawagi. 2024. Efficient Training of Language Models with Compact and Consistent Next Token Distributions. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12051–12064, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Efficient Training of Language Models with Compact and Consistent Next Token Distributions (Sathe & Sarawagi, Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.717.pdf