FIRE: Flexible Integration of Data Quality Ratings for Effective Pretraining

Xu Liangyu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai


Abstract
Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5% tokens needed by the Random baseline to reach the target performance.
Anthology ID:
2025.emnlp-main.735
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14532–14552
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.735/
DOI:
Bibkey:
Cite (ACL):
Xu Liangyu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Rongxiang Weng, Jingang Wang, and Xunliang Cai. 2025. FIRE: Flexible Integration of Data Quality Ratings for Effective Pretraining. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14532–14552, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
FIRE: Flexible Integration of Data Quality Ratings for Effective Pretraining (Liangyu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.735.pdf
Checklist:
 2025.emnlp-main.735.checklist.pdf