Duc Thien Nguyen


2025

pdf bib
Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen | Philip Arthur | Qianyu Feng | Cong Duy Vu Hoang | Yu-Heng Hong | Mahdi Kazemi Moghaddam | Omid Nezami | Duc Thien Nguyen | Gioacchino Tangari | Duy Vu | Thanh Vu | Mark Johnson | Krishnaram Kenthapadi | Don Dharmasiri | Long Duong | Yuan-Fang Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have shown impressive performance in code understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.