Automatic Document Selection for Efficient Encoder Pretraining

Yukun Feng, Patrick Xia, Benjamin Van Durme, João Sedoc


Abstract
Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.
Anthology ID:
2022.emnlp-main.647
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9522–9530
Language:
URL:
https://aclanthology.org/2022.emnlp-main.647
DOI:
10.18653/v1/2022.emnlp-main.647
Bibkey:
Cite (ACL):
Yukun Feng, Patrick Xia, Benjamin Van Durme, and João Sedoc. 2022. Automatic Document Selection for Efficient Encoder Pretraining. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9522–9530, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Automatic Document Selection for Efficient Encoder Pretraining (Feng et al., EMNLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2022.emnlp-main.647.pdf