FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Wei Li (李维, 李炜); Xin Zhang; Zhongxin Guo; Shaoguang Mao; Wen Luo; Guangyue Peng; Yangyu Huang; Houfeng Wang; Scarlett Li

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li

Abstract

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs’ automated software engineering capabilities.Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.

Anthology ID:: 2025.acl-long.839
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17160–17176
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.839/
DOI:
Bibkey:
Cite (ACL):: Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. 2025. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17160–17176, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation (Li et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.839.pdf

PDF Cite Search Fix data