Training compute-optimal transformer encoder models

Megi Dervishi, Alexandre Allauzen, Gabriel Synnaeve, Yann LeCun


Abstract
Transformer encoders are critical for a wide range of Natural Language Processing (NLP) tasks, yet their compute–efficiency remains poorly understood. We present the first comprehensive empirical investigation of compute-optimal pretraining for encoder transformers using the Masked Language Modeling (MLM) objective. Across hundreds of carefully controlled runs we vary model size, data size, batch size, learning rate, and masking ratio, with increasing compute budget. The compute-optimal data-to-model ratio of Transformer encoder models is 10 to 100 times larger than the ratio of auto-regressive models. Using these recipes, we train OptiBERT, a family of compute-optimal BERT-style models that matches or surpasses leading baselines—including ModernBERT and NeoBERT—on GLUE and MTEB while training with dramatically less FLOPS.
Anthology ID:
2025.emnlp-main.1804
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35602–35617
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1804/
DOI:
Bibkey:
Cite (ACL):
Megi Dervishi, Alexandre Allauzen, Gabriel Synnaeve, and Yann LeCun. 2025. Training compute-optimal transformer encoder models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35602–35617, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Training compute-optimal transformer encoder models (Dervishi et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1804.pdf
Checklist:
 2025.emnlp-main.1804.checklist.pdf