Zelai Yang
2026
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic, sequential engineering tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. LongCLI-Bench employs a dual-set testing protocol, which measures requirement fulfillment (fail(→)pass) and regression avoidance (pass(→)pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents’ planning and execution capabilities to overcome key challenges in long-horizon task performance.