Jiaang Li
Other people with similar names: Jiaang Li
Unverified author pages with similar names: Jiaang Li
2026
SELECting over Tokens: Curating Pre-training Data at Scale via Token Classification
Xin Tong | Weidong Zhang | Jiaang Li | Haibin Chen | Shilei Liu | Langming Liu | Kangtao Lv | Yujin Yuan | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xin Tong | Weidong Zhang | Jiaang Li | Haibin Chen | Shilei Liu | Langming Liu | Kangtao Lv | Yujin Yuan | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The quality of pre-training data critically impacts the capabilities of large language models. Existing pipelines rely on expert-crafted heuristic rules, which primarily operate at the sample level and are based on coarse statistical indicators, thus lacking content-aware, fine-grained noise detection. While recent generative approaches, e.g., ProX-C, enable token-level refinement, their reliance on synthesizing Python code incurs prohibitive computational cost at scale and can introduce hallucinations into the refined data. To overcome these limitations, we propose Selecting over Tokens (SelecT), a novel framework that reframes data refinement as a highly efficient token classification task. SelecT classifies each token as either informative or noisy and subsequently removes the latter. This design achieves fine-grained data optimization while avoiding the inefficiency of generation, ensuring scalability. When evaluated on diverse downstream benchmarks, the model trained on SelecT-refined corpora, on average, outperforms the one trained on raw data by over 2% and exceeds the best heuristic baselines by more than 1% while preserving 17% more tokens than the latter. Furthermore, SelecT achieves higher average performance than the generative ProX-C across all experimental settings, and is 2.5x faster at inference, even with twice the parameters. Our results establish SelecT as an effective, efficient, and scalable solution for pre-training data optimization.