SELECting over Tokens: Curating Pre-training Data at Scale via Token Classification

Xin Tong; Weidong Zhang; Jiaang Li; Haibin Chen; Shilei Liu; Langming Liu; Kangtao Lv; Yujin Yuan; Wenbo Su; Bo Zheng

SELECting over Tokens: Curating Pre-training Data at Scale via Token Classification

Xin Tong, Weidong Zhang, Jiaang Li, Haibin Chen, Shilei Liu, Langming Liu, Kangtao Lv, Yujin Yuan, Wenbo Su, Bo Zheng

Abstract

The quality of pre-training data critically impacts the capabilities of large language models. Existing pipelines rely on expert-crafted heuristic rules, which primarily operate at the sample level and are based on coarse statistical indicators, thus lacking content-aware, fine-grained noise detection. While recent generative approaches, e.g., ProX-C, enable token-level refinement, their reliance on synthesizing Python code incurs prohibitive computational cost at scale and can introduce hallucinations into the refined data. To overcome these limitations, we propose Selecting over Tokens (SelecT), a novel framework that reframes data refinement as a highly efficient token classification task. SelecT classifies each token as either informative or noisy and subsequently removes the latter. This design achieves fine-grained data optimization while avoiding the inefficiency of generation, ensuring scalability. When evaluated on diverse downstream benchmarks, the model trained on SelecT-refined corpora, on average, outperforms the one trained on raw data by over 2% and exceeds the best heuristic baselines by more than 1% while preserving 17% more tokens than the latter. Furthermore, SelecT achieves higher average performance than the generative ProX-C across all experimental settings, and is 2.5x faster at inference, even with twice the parameters. Our results establish SelecT as an effective, efficient, and scalable solution for pre-training data optimization.

Anthology ID:: 2026.acl-long.2219
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 48060–48085
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2219/
DOI:
Bibkey:
Cite (ACL):: Xin Tong, Weidong Zhang, Jiaang Li, Haibin Chen, Shilei Liu, Langming Liu, Kangtao Lv, Yujin Yuan, Wenbo Su, and Bo Zheng. 2026. SELECting over Tokens: Curating Pre-training Data at Scale via Token Classification. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 48060–48085, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SELECting over Tokens: Curating Pre-training Data at Scale via Token Classification (Tong et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2219.pdf
Checklist:: 2026.acl-long.2219.checklist.pdf

PDF Cite Search Checklist Fix data