Christophe Servan
Other people with similar names: Christophe Servan
2026
A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Est\`eve | Christophe Servan | Thomas Lavergne | Agata Savary
Findings of the Association for Computational Linguistics: ACL 2026
Louis Est\`eve | Christophe Servan | Thomas Lavergne | Agata Savary
Findings of the Association for Computational Linguistics: ACL 2026
Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons to investigate theimpact of diversity on pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining atleast comparable performance. We compare diversity-driven sampling algorithms, and we use the best one to pre-train several ModernBERT models on French with a fixed compute budget. We fine-tune and evaluate them on a variety of French benchmarks. We compare them with models pre-trained on randomly sampled data of commensurate size, with the same compute budget. We find that both random and diversity-driven sampling may reduce the pre-training dataset by up to 94% and the pre-training time by up to 73% while maintaining performance. Moreover, in some tasks, the inherent quality of models, estimated via head-only fine-tuning, is up to 10 points higher with diversity sampling than with random sampling.