Exploring Distributional Shifts in Large Language Models for Code Analysis

Shushan Arakelyan; Rocktim Das; Yi Mao; Xiang Ren

doi:10.18653/v1/2023.emnlp-main.1013

Exploring Distributional Shifts in Large Language Models for Code Analysis

Shushan Arakelyan, Rocktim Das, Yi Mao, Xiang Ren

Abstract

We systematically study how three large language models with code capabilities - CodeT5, Codex, and ChatGPT - generalize to out-of-domain data. We consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. We establish that samples from each new domain present all the models with a significant challenge of distribution shift. We study how established methods adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. Moreover, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that for code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to a single domain.

Anthology ID:: 2023.emnlp-main.1013
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16298–16314
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.emnlp-main.1013/
DOI:: 10.18653/v1/2023.emnlp-main.1013
Bibkey:
Cite (ACL):: Shushan Arakelyan, Rocktim Das, Yi Mao, and Xiang Ren. 2023. Exploring Distributional Shifts in Large Language Models for Code Analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16298–16314, Singapore. Association for Computational Linguistics.
Cite (Informal):: Exploring Distributional Shifts in Large Language Models for Code Analysis (Arakelyan et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.emnlp-main.1013.pdf
Video:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.emnlp-main.1013.mp4

PDF Cite Search Video Fix data