Abstract
This paper reports on the experiments aimed to improve our understanding of the role of the amount of data required for training attention-based transformer language models. Specifically, we investigate the impact of reducing the immense amounts of required pre-training data through sampling strategies that identify and reduce high-frequency tokens as different studies have indicated that the existence of very high-frequency tokens in pre-training data might bias learning, causing undesired effects. In this light, we describe our sampling algorithm that iteratively assesses token frequencies and removes sentences that contain still high-frequency tokens, eventually delivering a balanced, linguistically correct dataset. We evaluate the results in terms of model perplexity and fine-tuning linguistic probing tasks, NLP downstream tasks as well as more semantic SuperGlue tasks. The results show that pre-training with the resulting balanced dataset allows reducing up to three times the pre-training data.- Anthology ID:
- 2023.findings-emnlp.527
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7859–7872
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.527
- DOI:
- 10.18653/v1/2023.findings-emnlp.527
- Cite (ACL):
- Rodolfo Zevallos, Mireia Farrús, and Núria Bel. 2023. Frequency Balanced Datasets Lead to Better Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7859–7872, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Frequency Balanced Datasets Lead to Better Language Models (Zevallos et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2023.findings-emnlp.527.pdf