Addressing Leakage in Self-Supervised Contextualized Code Retrieval
Johannes Villmow, Viola Campos, Adrian Ulges, Ulrich Schwanecke
Abstract
We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that the proposed approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.- Anthology ID:
- 2022.coling-1.84
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Editors:
- Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 1006–1013
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.84
- DOI:
- Cite (ACL):
- Johannes Villmow, Viola Campos, Adrian Ulges, and Ulrich Schwanecke. 2022. Addressing Leakage in Self-Supervised Contextualized Code Retrieval. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1006–1013, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- Addressing Leakage in Self-Supervised Contextualized Code Retrieval (Villmow et al., COLING 2022)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2022.coling-1.84.pdf
- Data
- CodeXGLUE