GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding

Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, Ansgar Scherp


Abstract
Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code search tasks using natural language queries. These capabilities are heavily influenced by the quality and diversity of the available training data. Source code datasets used for training usually focus on the most popular languages and testing is mostly conducted on the same distributions, often overlooking low-resource programming languages. Motivated by the NLP generalization taxonomy proposed by Hupkes et.,al., we propose a new benchmark dataset called GenCodeSearchNet (GeCS) which builds upon existing natural language code search datasets to systemically evaluate the programming language understanding generalization capabilities of language models. As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language that is often used by researchers outside the field of computer science. For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models in a zero-shot setting.
Anthology ID:
2023.genbench-1.2
Volume:
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
Month:
December
Year:
2023
Address:
Singapore
Editors:
Dieuwke Hupkes, Verna Dankers, Khuyagbaatar Batsuren, Koustuv Sinha, Amirhossein Kazemnejad, Christos Christodoulopoulos, Ryan Cotterell, Elia Bruni
Venues:
GenBench | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12–24
Language:
URL:
https://aclanthology.org/2023.genbench-1.2
DOI:
10.18653/v1/2023.genbench-1.2
Bibkey:
Cite (ACL):
Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, and Ansgar Scherp. 2023. GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding. In Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, pages 12–24, Singapore. Association for Computational Linguistics.
Cite (Informal):
GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding (Diera et al., GenBench-WS 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2023.genbench-1.2.pdf
Video:
 https://preview.aclanthology.org/emnlp-22-attachments/2023.genbench-1.2.mp4