CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou
Abstract
We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both “bimodal” data of NL-PL pairs and “unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NLPL probing.- Anthology ID:
- 2020.findings-emnlp.139
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2020
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1536–1547
- Language:
- URL:
- https://aclanthology.org/2020.findings-emnlp.139
- DOI:
- 10.18653/v1/2020.findings-emnlp.139
- Cite (ACL):
- Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics.
- Cite (Informal):
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages (Feng et al., Findings 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.139.pdf
- Code
- microsoft/CodeBERT + additional community code
- Data
- CodeSearchNet, ManyTypes4TypeScript