Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories

Minyu Chen, Guoqiang Li, Chen Ma, Jingyang Li, Hongfei Fu


Abstract
Open-source platforms such as GitHub and Stack Overflow both play significant roles in current software ecosystems. It is crucial but time-consuming for developers to raise programming questions in coding forums such as Stack Overflow and be navigated to actual solutions on GitHub repositories. In this paper, we dedicate to accelerating this activity. We find that traditional information retrieval-based methods fail to handle the long and complex questions in coding forums, and thus cannot find suitable coding repositories. To effectively and efficiently bridge the semantic gap between repositories and real-world coding questions, we introduce a specialized dataset named Repo4QA, which includes over 12,000 question-repository pairs constructed from Stack Overflow and GitHub. Furthermore, we propose QuRep, a CodeBERT-based model that jointly learns the representation of both questions and repositories. Experimental results demonstrate that our model simultaneously captures the semantic features in both questions and repositories through supervised contrastive loss and hard negative sampling. We report that our approach outperforms existing state-of-art methods by 3%-8% on MRR and 5%-8% on P@1.
Anthology ID:
2022.coling-1.136
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1580–1592
Language:
URL:
https://aclanthology.org/2022.coling-1.136
DOI:
Bibkey:
Cite (ACL):
Minyu Chen, Guoqiang Li, Chen Ma, Jingyang Li, and Hongfei Fu. 2022. Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1580–1592, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories (Chen et al., COLING 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.coling-1.136.pdf
Code
 minkow/repo4qa
Data
CoSQACodeXGLUE