Maoquan Wang

2023

Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for program translation due to a dearth of aligned data. Recent unsupervised neural machine translation techniques have overcome data limitations by included techniques such as back translation and low level compiler intermediate representations (IR). These methods face significant challenges due to the noise in code snippet alignment and the diversity of IRs respectively. In this paper we propose a novel model called Code Distillation (CoDist) whereby we capture the semantic and structural equivalence of code in a language agnostic intermediate representation. Distilled code serves as a translation pivot for any programming language, leading by construction to parallel corpora which scale to all available source code by simply applying the distillation compiler. We demonstrate that our approach achieves state-of-the-art performance on CodeXGLUE and TransCoder GeeksForGeeks translation benchmarks, with an average absolute increase of 12.7% on the TransCoder GeeksforGeeks translation benchmark compare to TransCoder-ST.

Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these issues. In this paper we introduce a new metrics for programming language translation and these metrics address these basic syntax errors. We develop a novel active defects probing suite called Syntactic Unit Tests (SUT) which includes a highly interpretable evaluation harness for accuracy and test scoring. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests. Specifically, compared to previous program translation task evaluation dataset, its pass rate on our unit tests has decreased by 26.15%. Further our evaluation harness reveal syntactic element errors in which these models exhibit deficiencies.

2018

pdf abs
EmojiIt at SemEval-2018 Task 2: An Effective Attention-Based Recurrent Neural Network Model for Emoji Prediction with Characters Gated Words
Shiyun Chen | Maoquan Wang | Liang He
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper presents our single model to Subtask 1 of SemEval 2018 Task 2: Emoji Prediction in English. In order to predict the emoji that may be contained in a tweet, the basic model we use is an attention-based recurrent neural network which has achieved satisfactory performs in Natural Language processing. Considering the text comes from social media, it contains many discrepant abbreviations and online terms, we also combine word-level and character-level word vector embedding to better handling the words not appear in the vocabulary. Our single model1 achieved 29.50% Macro F-score in test data and ranks 9th among 48 teams.

2017

pdf abs
EICA Team at SemEval-2017 Task 3: Semantic and Metadata-based Features for Community Question Answering
Yufei Xie | Maoquan Wang | Jing Ma | Jian Jiang | Zhao Lu
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We describe our system for participating in SemEval-2017 Task 3 on Community Question Answering. Our approach relies on combining a rich set of various types of features: semantic and metadata. The most important group turned out to be the metadata feature and the semantic vectors trained on QatarLiving data. In the main Subtask C, our primary submission was ranked fourth, with a MAP of 13.48 and accuracy of 97.08. In Subtask A, our primary submission get into the top 50%.

pdf abs
EICA at SemEval-2017 Task 4: A Simple Convolutional Neural Network for Topic-based Sentiment Classification
Maoquan Wang | Shiyun Chen | Yufei Xie | Lu Zhao
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our approach for SemEval-2017 Task 4 - Sentiment Analysis in Twitter (SAT). Its five subtasks are divided into two categories: (1) sentiment classification, i.e., predicting topic-based tweet sentiment polarity, and (2) sentiment quantification, that is, estimating the sentiment distributions of a set of given tweets. We build a convolutional sentence classification system for the task of SAT. Official results show that the experimental results of our system are comparative.

Co-authors

Jing Ma 1

Zhao Lu 1

Lu Zhao 1

Liang He 1

Venues

semeval3
emnlp2