Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Tingyu Xia; Bowen Yu; Kai Dang; An Yang; Yuan Wu; Yuan Tian; Yi Chang; Junyang Lin

doi:10.18653/v1/2025.findings-emnlp.146

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang, Junyang Lin

Abstract

Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods—those that do not rely on external model assistance—on two million-scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high-quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long-text data, proves highly beneficial for relatively weaker base models, such as Llama3. The code is available at https://github.com/xiatingyu/SFT-DataSelection-at-scale.

Anthology ID:: 2025.findings-emnlp.146
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2698–2711
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.146/
DOI:: 10.18653/v1/2025.findings-emnlp.146
Bibkey:
Cite (ACL):: Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang, and Junyang Lin. 2025. Rethinking Data Selection at Scale: Random Selection is Almost All You Need. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2698–2711, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Rethinking Data Selection at Scale: Random Selection is Almost All You Need (Xia et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.146.pdf
Checklist:: 2025.findings-emnlp.146.checklist.pdf

PDF Cite Search Checklist Fix data