Program Translation via Code Distillation

Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, Neel Sundaresan


Abstract
Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for program translation due to a dearth of aligned data. Recent unsupervised neural machine translation techniques have overcome data limitations by included techniques such as back translation and low level compiler intermediate representations (IR). These methods face significant challenges due to the noise in code snippet alignment and the diversity of IRs respectively. In this paper we propose a novel model called Code Distillation (CoDist) whereby we capture the semantic and structural equivalence of code in a language agnostic intermediate representation. Distilled code serves as a translation pivot for any programming language, leading by construction to parallel corpora which scale to all available source code by simply applying the distillation compiler. We demonstrate that our approach achieves state-of-the-art performance on CodeXGLUE and TransCoder GeeksForGeeks translation benchmarks, with an average absolute increase of 12.7% on the TransCoder GeeksforGeeks translation benchmark compare to TransCoder-ST.
Anthology ID:
2023.emnlp-main.672
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10903–10914
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2023.emnlp-main.672/
DOI:
10.18653/v1/2023.emnlp-main.672
Bibkey:
Cite (ACL):
Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, and Neel Sundaresan. 2023. Program Translation via Code Distillation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10903–10914, Singapore. Association for Computational Linguistics.
Cite (Informal):
Program Translation via Code Distillation (Huang et al., EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2023.emnlp-main.672.pdf
Video:
 https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2023.emnlp-main.672.mp4