Bo Shen
2025
CodeV: Issue Resolving with Visual Data
Linhao Zhang
|
Daoguang Zan
|
Quanshun Yang
|
Zhirong Huang
|
Dong Chen
|
Bo Shen
|
Tianyu Liu
|
Yongshun Gong
|
Huang Pengjie
|
Xudong Lu
|
Guangtai Liang
|
Lizhen Cui
|
Qianxiang Wang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have advanced rapidly in recent years, with their applications in software engineering expanding to more complex repository-level tasks. GitHub issue resolving is a key challenge among these tasks. While recent approaches have made progress on this task, they focus on textual data within issues, neglecting visual data. However, this visual data is crucial for resolving issues as it conveys additional knowledge that text alone cannot. We propose CodeV, the first approach to leveraging visual data to enhance the issue-resolving capabilities of LLMs. CodeV resolves each issue by following a two-phase process: data processing and patch generation. To evaluate CodeV, we construct a benchmark for visual issue resolving, namely Visual SWE-bench. Through extensive experiments, we demonstrate the effectiveness of CodeV, as well as provide valuable insights into leveraging visual data to resolve GitHub issues.
2024
CodeM: Less Data Yields More Versatility via Ability Matrix
Daoguang Zan
|
Ailun Yu
|
Wei Liu
|
Bo Shen
|
Shaoxin Lin
|
Yongshun Gong
|
Yafen Yao
|
Yan Liu
|
Bei Guan
|
Weihua Luo
|
Yongji Wang
|
Qianxiang Wang
|
Lizhen Cui
Findings of the Association for Computational Linguistics: ACL 2024
In the era of code large language models (code LLMs), data engineering plays a pivotal role during the instruction fine-tuning phase. To train a versatile model, previous efforts devote tremendous efforts into crafting instruction data covering all the downstream scenarios. Nonetheless, this will incur significant expenses in constructing data and training model. Therefore, this paper introduces CodeM, a novel data construction strategy, which can efficiently train a versatile model using less data via our newly proposed ability matrix. CodeM uses ability matrix to decouple code LLMs’ abilities into two dimensions, constructing a lightweight training corpus that only covers a subset of target scenarios. Extensive experiments on HumanEvalPack and MultiPL-E imply that code LLMs can combine the single-dimensional abilities to master composed abilities, validating the effectiveness of CodeM.
Search
Fix author
Co-authors
- Lizhen Cui 2
- Yongshun Gong 2
- Qianxiang Wang 2
- Daoguang Zan 2
- Dong Chen 1
- show all...