Duc Thien Nguyen

2025

Large language models (LLMs) have shown impressive performance in code understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

Co-authors

Cong Duy Vu Hoang 1

Yu-Heng Hong 1

Mark Johnson 1

Krishnaram Kenthapadi 1

Yuan-Fang Li 1

Mahdi Kazemi Moghaddam 1

Omid Nezami 1

Gioacchino Tangari 1

Duy Vu 1

Thanh Vu 1

Venues

naacl1

Fix data