Joonghyuk Hahn

2022

pdf abs
Boosting Code Summarization by Embedding Code Structures
Jikyoeng Son | Joonghyuk Hahn | HyeonTae Seo | Yo-Sub Han
Proceedings of the 29th International Conference on Computational Linguistics

Recent research on code summarization relies on the structural information from the abstract syntax tree (AST) of source codes. It is, however, questionable whether it is the most effective to use AST for expressing the structural information. We find that a program dependency graph (PDG) can represent the structure of a code more effectively. We propose PDG Boosting Module (PBM) that encodes PDG into graph embedding and the framework to implement the proposed PBM with the existing models. PBM achieves improvements of 6.67% (BLEU) and 7.47% (ROUGE) on average. We then analyze the experimental results, and examine how PBM helps the training of baseline models and its performance robustness. For the validation of robustness, we measure the performance of an out-of-domain benchmark dataset, and confirm its robustness. In addition, we apply a new evaluation measure, SBERT score, to evaluate the semantic performance. The models implemented with PBM improve the performance of SBERT score. This implies that they generate summaries that are semantically more similar to the reference summary.

2021

We tackle the problem of self-training networks for NLU in low-resource environment—few labeled data and lots of unlabeled data. The effectiveness of self-training is a result of increasing the amount of training data while training. Yet it becomes less effective in low-resource settings due to unreliable labels predicted by the teacher model on unlabeled data. Rules of grammar, which describe the grammatical structure of data, have been used in NLU for better explainability. We propose to use rules of grammar in self-training as a more reliable pseudo-labeling mechanism, especially when there are few labeled data. We design an effective algorithm that constructs and expands rules of grammar without human involvement. Then we integrate the constructed rules as a pseudo-labeling mechanism into self-training. There are two possible scenarios regarding data distribution: it is unknown or known in prior to training. We empirically demonstrate that our approach substantially outperforms the state-of-the-art methods in three benchmark datasets for both scenarios.

Co-authors

Jikyoeng Son 1

HyeonTae Seo 1

Venues

findings1
coling1