UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Daya Guo; Shuai Lu; Nan Duan; Yanlin Wang; Ming Zhou; Jian Yin

doi:10.18653/v1/2022.acl-long.499

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin

Abstract

Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.

Anthology ID:: 2022.acl-long.499
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7212–7225
Language:
URL:: https://aclanthology.org/2022.acl-long.499
DOI:: 10.18653/v1/2022.acl-long.499
Bibkey:
Cite (ACL):: Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: UniXcoder: Unified Cross-Modal Pre-training for Code Representation (Guo et al., ACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/remove-xml-comments/2022.acl-long.499.pdf
Code: microsoft/CodeBERT + additional community code
Data: CoSQA, CodeSearchNet, CodeXGLUE

PDF Search Code