CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

Qiushi Sun, Jingyang Gong, Lei Li, Qipeng Guo, Fei Yuan


Abstract
Acquiring high-quality instruction-code pairs is essential for training Large Language Models for code generation. While automated synthesis has emerged as an alternative to expensive manual curation, current approaches often rely on rigid heuristics, yielding data that is ungrounded or lacks logical complexity. We propose CodeEvo, a dual-agent architecture comprising a Coder for iterative solution synthesis and a Reviewer to orchestrate the generation trajectory. To transcend the limitations of existing heuristics, the Reviewer formulates a Schema to systematically architect logic and complexity through an interleaved synthesis of instructions and code. This process is further reinforced by a hybrid verification protocol synergizing deterministic compiler feedback with semantic evaluation. Under this framework, we construct CodeEvo-100K, a large-scale dataset of instruction–code pairs with stepped difficulty levels. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks. In-depth analyses further provide insights into effective code-centric data synthesis. Code and data are available at https://github.com/QiushiSun/CodeEvo.
Anthology ID:
2026.acl-long.438
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9668–9687
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.438/
DOI:
Bibkey:
Cite (ACL):
Qiushi Sun, Jingyang Gong, Lei Li, Qipeng Guo, and Fei Yuan. 2026. CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9668–9687, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback (Sun et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.438.pdf
Checklist:
 2026.acl-long.438.checklist.pdf