Abstract
Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a data augmentation framework using knowledge distillation. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to augment training data, which is then used for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs’ performance in sequence-generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.- Anthology ID:
- 2023.findings-acl.823
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12994–13002
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.823
- DOI:
- 10.18653/v1/2023.findings-acl.823
- Cite (ACL):
- Hung To, Nghi Bui, Jin L.C. Guo, and Tien Nguyen. 2023. Better Language Models of Code through Self-Improvement. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12994–13002, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Better Language Models of Code through Self-Improvement (To et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/2023.findings-acl.823.pdf