Abstract
While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.- Anthology ID:
- 2023.findings-eacl.146
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2023
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editors:
- Andreas Vlachos, Isabelle Augenstein
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1954–1974
- Language:
- URL:
- https://aclanthology.org/2023.findings-eacl.146
- DOI:
- 10.18653/v1/2023.findings-eacl.146
- Cite (ACL):
- David Samuel, Andrey Kutuzov, Lilja Øvrelid, and Erik Velldal. 2023. Trained on 100 million words and still in shape: BERT meets British National Corpus. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1954–1974, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- Trained on 100 million words and still in shape: BERT meets British National Corpus (Samuel et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2023.findings-eacl.146.pdf