UnitCoder: Scalable Code Synthesis from Pre-training Corpora

Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, Kai Chen


Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Despite the abundant sources of code data, constructing high-quality training datasets at scale poses a significant challenge. Pre-training code data typically suffers from inconsistent data quality issues. Conversely, instruction-based methods which use a high-quality subset as seed samples suffer from limited task diversity. In this paper, we introduce UnitCoder, which directly supervises pre-training data quality through automatically generated unit tests, while ensuring the correctness via an iterative fix and refine flow. Code synthesized by UnitCoder benefits from both the diversity of pre-training corpora and the high quality ensured by unit test supervision. Our experiments demonstrate that models fine-tuned on our synthetic dataset exhibit consistent performance improvements. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released.
Anthology ID:
2025.emnlp-main.286
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5623–5641
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.286/
DOI:
Bibkey:
Cite (ACL):
Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, and Kai Chen. 2025. UnitCoder: Scalable Code Synthesis from Pre-training Corpora. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5623–5641, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
UnitCoder: Scalable Code Synthesis from Pre-training Corpora (Ma et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.286.pdf
Checklist:
 2025.emnlp-main.286.checklist.pdf