Nghi Bui

2023

Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets.However, they are unstable when applied to deep networks.In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem.We show that IFs are unreliable when the two data points belong to two different classes.Our solution leverages class information to improve the stability of IFs.Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.

pdf abs
Better Language Models of Code through Self-Improvement
Hung To | Nghi Bui | Jin L.C. Guo | Tien Nguyen
Findings of the Association for Computational Linguistics: ACL 2023

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a data augmentation framework using knowledge distillation. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to augment training data, which is then used for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs’ performance in sequence-generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.

2022

pdf abs
Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5
Nghi Bui | Yue Wang | Steven C.H. Hoi
Findings of the Association for Computational Linguistics: EMNLP 2022

Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In this work, we propose a novel unified Detect-Localize-Repair framework based on a pretrained programming language model CodeT5 to seamlessly address these tasks, named CodeT5-DLR. Specifically, we propose three objectives to adapt the generic CodeT5 for debugging: a bug detection objective to determine whether a given code snippet is buggy or not, a bug localization objective to identify the buggy lines, and a program repair objective to translate the buggy code to its fixed version. We evaluate it on each of these tasks and their combined setting on two newly collected line-level debugging datasets in Java and Python. Extensive results show that our model significantly outperforms existing baselines from both NLP and software engineering domains.

Co-authors

Hung To 1

Venues

findings2
acl1