Hongfei Yan
2026
Large-Scale Diverse Synthesis for Mid-Training
Xuemiao Zhang | Chengying Tu | Can Ren | Rongxiang Weng | Hongfei Yan | Jingang Wang | Xunliang Cai
Findings of the Association for Computational Linguistics: ACL 2026
Xuemiao Zhang | Chengying Tu | Can Ren | Rongxiang Weng | Hongfei Yan | Jingang Wang | Xunliang Cai
Findings of the Association for Computational Linguistics: ACL 2026
Mid-training has become critical for enhancing the knowledge and reasoning ability of large language models (LLMs), especially through the utilization of large-scale synthetic data. However, existing data synthesis methods often generate simplistic and homogeneous QA pairs, with limited scale and diversity. To address this, we propose BoostQA, a novel framework designed to synthesize large-scale, diverse, and high-quality QA data for mid-training. BoostQA introduces model probes during mid-training for the first time and implements STEM-focused multi-grade synthesis to boost data diversity as well as high-difficulty synthesis to alleviate difficulty degradation, followed by answer refinement to further improve quality. Extensive experiments by mid-training Llama-3 8B demonstrate that using only 20B-token BoostQA data achieves a significant average improvement of **12.74%** on MMLU and CMMLU over the pre-training baseline. After mid-training on 500B tokens, including 100B-token BoostQA data, our model achieves SOTA average results across benchmarks among mainstream models of comparable size. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.
LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points
Xuemiao Zhang | Can Ren | Chengying Tu | Rongxiang Weng | Hongfei Yan | Jingang Wang | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xuemiao Zhang | Can Ren | Chengying Tu | Rongxiang Weng | Hongfei Yan | Jingang Wang | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a KP-graph-based synthesis framework that for the first time enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via a strong reasoning model by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of 11.51% on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.
2015
User Based Aggregation for Biterm Topic Model
Weizheng Chen | Jinpeng Wang | Yan Zhang | Hongfei Yan | Xiaoming Li
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Weizheng Chen | Jinpeng Wang | Yan Zhang | Hongfei Yan | Xiaoming Li
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
2014
Group based Self Training for E-Commerce Product Record Linkage
Xin Zhao | Yuexin Wu | Hongfei Yan | Xiaoming Li
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
Xin Zhao | Yuexin Wu | Hongfei Yan | Xiaoming Li
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
2013
Mining New Business Opportunities: Identifying Trend related Products by Leveraging Commercial Intents from Microblogs
Jinpeng Wang | Wayne Xin Zhao | Haitian Wei | Hongfei Yan | Xiaoming Li
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
Jinpeng Wang | Wayne Xin Zhao | Haitian Wei | Hongfei Yan | Xiaoming Li
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
2012
Identifying Event-related Bursts via Social Media Activities
Xin Zhao | Baihan Shu | Jing Jiang | Yang Song | Hongfei Yan | Xiaoming Li
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Xin Zhao | Baihan Shu | Jing Jiang | Yang Song | Hongfei Yan | Xiaoming Li
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
A Novel Burst-based Text Representation Model for Scalable Event Detection
Xin Zhao | Rishan Chen | Kai Fan | Hongfei Yan | Xiaoming Li
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Xin Zhao | Rishan Chen | Kai Fan | Hongfei Yan | Xiaoming Li
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
SSHLDA: A Semi-Supervised Hierarchical Topic Model
Xian-Ling Mao | Zhao-Yan Ming | Tat-Seng Chua | Si Li | Hongfei Yan | Xiaoming Li
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Xian-Ling Mao | Zhao-Yan Ming | Tat-Seng Chua | Si Li | Hongfei Yan | Xiaoming Li
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning