Jinghan Zhang
Other people with similar names: Jinghan Zhang
Unverified author pages with similar names: Jinghan Zhang
2026
How Can Synthetic Data Improve Multilingual Language Model Pretraining? A Data Quality Perspective
Tongyao Zhu | Qian Liu | Chang Ma | Jinghan Zhang | Longxu Dou | Junxian He | Shiqi Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tongyao Zhu | Qian Liu | Chang Ma | Jinghan Zhang | Longxu Dou | Junxian He | Shiqi Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Low-resource languages challenge multilingual LLMs due to limited high-quality training data, leading to weaker performance on complex reasoning and knowledge tasks. To address this, we propose improving training data quality through data synthesis, moving beyond simple resource scaling. First, we introduce SynTrans, which translates high-quality, knowledge-rich English data into low-resource languages during pre-training to inject world knowledge, though at the cost of semantic fluency. To overcome low-quality data issues while maintaining fluency, we also propose SynRank. SynRank leverages synthetic data as positive samples to train a classifier that ranks and filters noisy real-world data, enabling the extraction of high-quality subsets without expensive human cleaning. Experiments show SynRank matches handcrafted rule-based filtering by human experts and significantly improves knowledge-intensive task performance at the same filtering rate. Remarkably, higher filtering rates even improve performance with less data, demonstrating the efficiency and effectiveness of our method, surpassing expert filtering. Lastly, we introduce DA-QwenScore, a training-free metric that evaluates corpus quality by normalizing model loss with diversity measures, further enhancing evaluation efficiency. Our insights into knowledge injection could advance low-resource multilingual LLM development.