Sequence Reducible Holdout Loss for Language Model Pretraining
Raghuveer Thirukovalluru, Nicholas Monath, Bhuwan Dhingra, Sam Wiseman
Abstract
Data selection techniques, which adaptively select datapoints inside the training loop, have demonstrated empirical benefits in reducing the number of gradient steps to train neural models. However, these techniques have so far largely been applied to classification. In this work, we study their applicability to language model pretraining, a highly time-intensive task. We propose a simple modification to an existing data selection technique (reducible hold-out loss training) in order to adapt it to the sequence losses typical in language modeling. We experiment on both autoregressive and masked language modelling, and show that applying data selection to pretraining offers notable benefits including a 4.3% reduction in total number of steps, a 21.5% steps reduction in average, to an intermediate target perplexity, over the course of pretraining an autoregressive language model. Further, data selection trained language models demonstrate significantly better generalization ability on out of domain datasets - 7.9% reduction in total number of steps and 23.2% average steps reduction to an intermediate target perplexity.- Anthology ID:
- 2024.lrec-main.1281
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 14705–14716
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.1281
- DOI:
- Cite (ACL):
- Raghuveer Thirukovalluru, Nicholas Monath, Bhuwan Dhingra, and Sam Wiseman. 2024. Sequence Reducible Holdout Loss for Language Model Pretraining. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14705–14716, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Sequence Reducible Holdout Loss for Language Model Pretraining (Thirukovalluru et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.1281.pdf