Joseph Jennings
2025
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Dan Su
|
Kezhi Kong
|
Ying Lin
|
Joseph Jennings
|
Brandon Norick
|
Markus Kliegl
|
Mostofa Patwary
|
Mohammad Shoeybi
|
Bryan Catanzaro
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html.
2024
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Jupinder Parmar
|
Shrimai Prabhumoye
|
Joseph Jennings
|
Bo Liu
|
Aastha Jhunjhunwala
|
Zhilin Wang
|
Mostofa Patwary
|
Mohammad Shoeybi
|
Bryan Catanzaro
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality pretraining sets.
Search
Fix author
Co-authors
- Bryan Catanzaro 2
- Mostofa Patwary 2
- Mohammad Shoeybi 2
- Aastha Jhunjhunwala 1
- Markus Kliegl 1
- show all...