GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Tobias Lindenbauer; Egor Bogomolov; Yaroslav Zharov

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Tobias Lindenbauer, Egor Bogomolov, Yaroslav Zharov

Abstract

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

Anthology ID:: 2025.realm-1.19
Volume:: Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Ehsan Kamalloo, Nicolas Gontier, Xing Han Lu, Nouha Dziri, Shikhar Murty, Alexandre Lacoste
Venues:: REALM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 272–288
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.realm-1.19/
DOI:
Bibkey:
Cite (ACL):: Tobias Lindenbauer, Egor Bogomolov, and Yaroslav Zharov. 2025. GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git. In Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 272–288, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git (Lindenbauer et al., REALM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.realm-1.19.pdf

PDF Cite Search Fix data