A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

Louis Est\`eve, Christophe Servan, Thomas Lavergne, Agata Savary


Abstract
Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons to investigate theimpact of diversity on pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining atleast comparable performance. We compare diversity-driven sampling algorithms, and we use the best one to pre-train several ModernBERT models on French with a fixed compute budget. We fine-tune and evaluate them on a variety of French benchmarks. We compare them with models pre-trained on randomly sampled data of commensurate size, with the same compute budget. We find that both random and diversity-driven sampling may reduce the pre-training dataset by up to 94% and the pre-training time by up to 73% while maintaining performance. Moreover, in some tasks, the inherent quality of models, estimated via head-only fine-tuning, is up to 10 points higher with diversity sampling than with random sampling.
Anthology ID:
2026.findings-acl.1707
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34168–34181
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1707/
DOI:
Bibkey:
Cite (ACL):
Louis Est\`eve, Christophe Servan, Thomas Lavergne, and Agata Savary. 2026. A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34168–34181, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT (Est`eve et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1707.pdf
Checklist:
 2026.findings-acl.1707.checklist.pdf