Scaling Parameter-Constrained Language Models with Quality Data

Ernie Chang; Matteo Paltenghi; Yang Li (李旸); Pin-Jie Lin; Changsheng Zhao; Patrick Huber; Zechun Liu; Rastislav Rabatin; Yangyang Shi; Vikas Chandra

doi:10.18653/v1/2024.emnlp-industry.8

Scaling Parameter-Constrained Language Models with Quality Data

Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, Vikas Chandra

Abstract

Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization.In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation – effective training tokens – which we posit to be a critical determinant of performance for parameter-constrained language models.Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text:(i) text diversity and (ii) syntheticity as measured by a teacher model.We pretrained over 200 models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores.We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyze it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.

Anthology ID:: 2024.emnlp-industry.8
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2024
Address:: Miami, Florida, US
Editors:: Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 80–97
Language:
URL:: https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.emnlp-industry.8/
DOI:: 10.18653/v1/2024.emnlp-industry.8
Bibkey:
Cite (ACL):: Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, and Vikas Chandra. 2024. Scaling Parameter-Constrained Language Models with Quality Data. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 80–97, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):: Scaling Parameter-Constrained Language Models with Quality Data (Chang et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.emnlp-industry.8.pdf

PDF Cite Search Fix data