Rethinking Data Selection at Scale: Random Selection is Almost All You Need
Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang, Junyang Lin
Abstract
Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods—those that do not rely on external model assistance—on two million-scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high-quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long-text data, proves highly beneficial for relatively weaker base models, such as Llama3. The code is available at https://github.com/xiatingyu/SFT-DataSelection-at-scale.- Anthology ID:
- 2025.findings-emnlp.146
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2698–2711
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.146/
- DOI:
- 10.18653/v1/2025.findings-emnlp.146
- Cite (ACL):
- Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang, and Junyang Lin. 2025. Rethinking Data Selection at Scale: Random Selection is Almost All You Need. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2698–2711, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Rethinking Data Selection at Scale: Random Selection is Almost All You Need (Xia et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.146.pdf