Yaroslav Zharov


2025

pdf bib
GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git
Tobias Lindenbauer | Egor Bogomolov | Yaroslav Zharov
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on Version Control System (VCS) tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

2024

pdf bib
Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks
Konstantin Grotov | Artem Borzilov | Maksim Krivobok | Timofey Bryksin | Yaroslav Zharov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Computational notebooks became indispensable tools for research-related development, offering unprecedented interactivity and flexibility in the development process. However, these benefits come at the cost of reproducibility and an increased potential for bugs.With the rise of code-fluent Large Language Models empowered with agentic techniques, smart bug-fixing tools with a high level of autonomy have emerged.However, those tools are tuned for classical script programming and still struggle with non-linear computational notebooks.In this paper, we present an AI agent designed specifically for error resolution in a computational notebook. We have developed an agentic system capable of exploring a notebook environment by interacting with it—similar to how a user would—and integrated the system into the JetBrains service for collaborative data science called Datalore.We evaluate our approach against the pre-existing single-action solution by comparing costs and conducting a user study. Users rate the error resolution capabilities of the agentic system higher but experience difficulties with UI. We share the results of the study and consider them valuable for further improving user-agent collaboration.