Multi-Docker-Eval: A ‘Shovel of the Gold Rush’ Benchmark on Automatic Environment Building for Software Engineering
Kelin Fu, Tianyu Liu, Zeyu Shang, Yingwei MA, Jiaheng Liu, Jian Yang, Kaigui Bian
Abstract
Automated environment configuration is a critical bottleneck in scaling software engineering (SWE) automation. To provide a reliable evaluation standard for this task, we present Multi-Docker-Eval benchmark. It includes 40 real-world repositories spanning 9 programming languages and measures both success in achieving executable states and efficiency under realistic constraints. Our extensive evaluation of state-of-the-art LLMs and agent frameworks reveals key insights: (1) the overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck; (2) model size and reasoning length are not decisive factors, and open-source models like DeepSeek-V3.1 and Kimi-K2 are competitive in both efficiency and effectiveness; (3) agent framework and programming language also have significantly influence on success rate. These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.- Anthology ID:
- 2026.findings-acl.889
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17911–17927
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.889/
- DOI:
- Cite (ACL):
- Kelin Fu, Tianyu Liu, Zeyu Shang, Yingwei MA, Jiaheng Liu, Jian Yang, and Kaigui Bian. 2026. Multi-Docker-Eval: A ‘Shovel of the Gold Rush’ Benchmark on Automatic Environment Building for Software Engineering. In Findings of the Association for Computational Linguistics: ACL 2026, pages 17911–17927, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Multi-Docker-Eval: A ‘Shovel of the Gold Rush’ Benchmark on Automatic Environment Building for Software Engineering (Fu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.889.pdf