How Do Your Code LLMs perform? Empowering Code Instruction Tuning with Really Good Data

Yejie Wang; Keqing He; Dayuan Fu; Zhuoma GongQue; Heyang Xu; Yanxu Chen; Zhexu Wang; Yujia Fu; Guanting Dong; Muxi Diao; Jingang Wang; Mengdi Zhang; Xunliang Cai; Weiran Xu

doi:10.18653/v1/2024.emnlp-main.777

How Do Your Code LLMs perform? Empowering Code Instruction Tuning with Really Good Data

Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu

Abstract

Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show Xcoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs.

Anthology ID:: 2024.emnlp-main.777
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14027–14043
Language:
URL:: https://aclanthology.org/2024.emnlp-main.777
DOI:: 10.18653/v1/2024.emnlp-main.777
Bibkey:
Cite (ACL):: Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, and Weiran Xu. 2024. How Do Your Code LLMs perform? Empowering Code Instruction Tuning with Really Good Data. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14027–14043, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: How Do Your Code LLMs perform? Empowering Code Instruction Tuning with Really Good Data (Wang et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2024.emnlp-main.777.pdf

PDF Search