Mehrzad Samadi
2025
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Somshubra Majumdar
|
Vahid Noroozi
|
Mehrzad Samadi
|
Sean Narenthiran
|
Aleksander Ficek
|
Wasi Uddin Ahmad
|
Jocelyn Huang
|
Jagadeesh Balam
|
Boris Ginsburg
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.
Search
Fix author
Co-authors
- Wasi Ahmad 1
- Jagadeesh Balam 1
- Aleksander Ficek 1
- Boris Ginsburg 1
- Jocelyn Huang 1
- show all...
Venues
- acl1