A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Est\`eve, Christophe Servan, Thomas Lavergne, Agata Savary
Abstract
Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons to investigate theimpact of diversity on pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining atleast comparable performance. We compare diversity-driven sampling algorithms, and we use the best one to pre-train several ModernBERT models on French with a fixed compute budget. We fine-tune and evaluate them on a variety of French benchmarks. We compare them with models pre-trained on randomly sampled data of commensurate size, with the same compute budget. We find that both random and diversity-driven sampling may reduce the pre-training dataset by up to 94% and the pre-training time by up to 73% while maintaining performance. Moreover, in some tasks, the inherent quality of models, estimated via head-only fine-tuning, is up to 10 points higher with diversity sampling than with random sampling.- Anthology ID:
- 2026.findings-acl.1707
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 34168–34181
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1707/
- DOI:
- Cite (ACL):
- Louis Est\`eve, Christophe Servan, Thomas Lavergne, and Agata Savary. 2026. A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34168–34181, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT (Est`eve et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1707.pdf