Training compute-optimal transformer encoder models
Megi Dervishi, Alexandre Allauzen, Gabriel Synnaeve, Yann LeCun
Abstract
Transformer encoders are critical for a wide range of Natural Language Processing (NLP) tasks, yet their compute–efficiency remains poorly understood. We present the first comprehensive empirical investigation of compute-optimal pretraining for encoder transformers using the Masked Language Modeling (MLM) objective. Across hundreds of carefully controlled runs we vary model size, data size, batch size, learning rate, and masking ratio, with increasing compute budget. The compute-optimal data-to-model ratio of Transformer encoder models is 10 to 100 times larger than the ratio of auto-regressive models. Using these recipes, we train OptiBERT, a family of compute-optimal BERT-style models that matches or surpasses leading baselines—including ModernBERT and NeoBERT—on GLUE and MTEB while training with dramatically less FLOPS.- Anthology ID:
- 2025.emnlp-main.1804
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 35602–35617
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1804/
- DOI:
- Cite (ACL):
- Megi Dervishi, Alexandre Allauzen, Gabriel Synnaeve, and Yann LeCun. 2025. Training compute-optimal transformer encoder models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35602–35617, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Training compute-optimal transformer encoder models (Dervishi et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1804.pdf