Jiahui Peng
2025
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
Tianyi Bai
|
Ling Yang
|
Zhen Hao Wong
|
Fupeng Sun
|
Xinlin Zhuang
|
Jiahui Peng
|
Chi Zhang
|
Lijun Wu
|
Qiu Jiantao
|
Wentao Zhang
|
Binhang Yuan
|
Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism. Each data selection method independently prioritizes data based on its specific criterion and updates its prioritization rules using the current state of the model, functioning as an independent actor for data selection. Additionally, a console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process. We conduct extensive empirical studies to evaluate our multi-actor framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LM pretraining, and achieves an average relative performance gain up to 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
|
Jiahui Peng
|
Ren Ma
|
Yinfan Wang
|
Tianyi Bai
|
Xingjian Wei
|
Qiu Jiantao
|
Chi Zhang
|
Ying Qian
|
Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality—a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23%, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at https://github.com/opendatalab/Meta-rater.
Search
Fix author
Co-authors
- Tianyi Bai 2
- Conghui He 2
- Qiu Jiantao 2
- Chi Zhang 2
- Xinlin Zhuang 2
- show all...
Venues
- acl2