Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.

Vijeta Deshpande, Dan Pechi, Shree Thatte, Vladislav Lialin, Anna Rumshisky


Abstract
In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below 2.2 × 1015 FLOPs. We also find that adding layers does not always benefit downstream performance.Our filtered pre-training data, reduced English vocabulary, and code are available at https://github.com/text-machine-lab/mini_bertgithub.com/text-machine-lab/mini_bert
Anthology ID:
2023.findings-acl.326
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5298–5314
Language:
URL:
https://aclanthology.org/2023.findings-acl.326
DOI:
10.18653/v1/2023.findings-acl.326
Bibkey:
Cite (ACL):
Vijeta Deshpande, Dan Pechi, Shree Thatte, Vladislav Lialin, and Anna Rumshisky. 2023. Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5298–5314, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale. (Deshpande et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2023.findings-acl.326.pdf
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2023.findings-acl.326.mp4