OmniCode: A Benchmark for Evaluating Software Development Agents

Atharv Sonwane; Eng-Shen Tu; Wei-Chung Lu; Claas Beger; Carter Larsen; Debjit Dhar; Simon Alford; Rachel Chen; Ronit Pattanayak; Tuan Anh Dang; Guohao Chen; Gloria Geng; Kevin Ellis; Saikat Dutta

OmniCode: A Benchmark for Evaluating Software Development Agents

Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta

Abstract

LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages – Python, Java, and C++ – and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real-world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE-Agent achieves a maximum of 25.0% with DeepSeek-V3.1 on C++ Test Generation. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development.

Anthology ID:: 2026.findings-acl.2020
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 40634–40661
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2020/
DOI:
Bibkey:
Cite (ACL):: Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, and Saikat Dutta. 2026. OmniCode: A Benchmark for Evaluating Software Development Agents. In Findings of the Association for Computational Linguistics: ACL 2026, pages 40634–40661, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: OmniCode: A Benchmark for Evaluating Software Development Agents (Sonwane et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2020.pdf
Checklist:: 2026.findings-acl.2020.checklist.pdf

PDF Cite Search Checklist Fix data