Junhao Wang


2025

pdf bib
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Linghao Zhang | Junhao Wang | Shilin He | Chaoyun Zhang | Yu Kang | Bowen Li | Jiaheng Wen | Chengxing Xie | Maoquan Wang | Yufan Huang | Elsie Nallipogu | Qingwei Lin | Yingnong Dang | Saravan Rajmohan | Dongmei Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.