Soft-Labeled Contrastive Pre-Training for Function-Level Code Representation

Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen, Nan Duan


Abstract
Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present SCodeR, a Soft-labeled contrastive pre-training framework with two positive sample construction methods to learn functional-level Code Representation. Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels through an iterative adversarial manner and use them to learn better code representation. The positive sample construction is another key for contrastive pre-training. Previous works use transformation-based methods like variable renaming to generate semantically equal positive codes. However, they usually result in the generated code with a highly similar surface form, and thus mislead the model to focus on superficial code structure instead of code semantics. To encourage SCodeR to capture semantic information from the code, we utilize code comments and abstract syntax sub-trees of the code to build positive samples. We conduct experiments on four code-related tasks over seven datasets. Extensive experimental results show that SCodeR achieves new state-of-the-art performance on all of them, which illustrates the effectiveness of the proposed pre-training method.
Anthology ID:
2022.findings-emnlp.9
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
118–129
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.9
DOI:
10.18653/v1/2022.findings-emnlp.9
Bibkey:
Cite (ACL):
Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. Soft-Labeled Contrastive Pre-Training for Function-Level Code Representation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 118–129, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Soft-Labeled Contrastive Pre-Training for Function-Level Code Representation (Li et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2022.findings-emnlp.9.pdf