Junhao Wang
2026
When 20 Agents Fail to Sort: The Distributed Sorting Benchmark for Scalable Multi-Agent Systems
Xin Yang | Junhao Wang | Bintao Tang | Xuxin Cheng | Cao Liu | Ke Zeng | Wenyuan Jiang
Findings of the Association for Computational Linguistics: ACL 2026
Xin Yang | Junhao Wang | Bintao Tang | Xuxin Cheng | Cao Liu | Ke Zeng | Wenyuan Jiang
Findings of the Association for Computational Linguistics: ACL 2026
Current LLM-based multi-agent systems remain fragile under scaling, even on algorithmically trivial tasks. We introduce MAS-BENCH, a distributed-sorting benchmark that isolates coordination under explicit communication constraints: each agent observes only a local segment and must collectively produce a globally consistent order via broadcasting, peer-to-peer messaging, or a shared key-value store. Across LLM-based agents, success drops sharply as the number of agents grows, exposing persistent failures in shared state, convention alignment, and consistent termination. To mitigate these breakdowns, we propose CAMOC, a lightweight, drop-in proof-of-concept built on collaboration-aware information sharing, early global metadata exchange, and single-commit verification. CAMOC substantially improves coordination success and efficiency across backends, with the largest gains under shared-state interaction. Overall, MAS-BENCH provides a diagnostic benchmark and CAMOC offers a practical step toward more reliable large-scale LLM collaboration, highlighting a gap between individual reasoning and collective correctness.
2025
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
Xing Zhang | Jiaheng Wen | Fangkai Yang | Yu Kang | Pu Zhao | Junhao Wang | Maoquan Wang | Yufan Huang | Shengyu Fu | Elsie Nallipogu | Qingwei Lin | Yingnong Dang | Saravan Rajmohan | Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Xing Zhang | Jiaheng Wen | Fangkai Yang | Yu Kang | Pu Zhao | Junhao Wang | Maoquan Wang | Yufan Huang | Shengyu Fu | Elsie Nallipogu | Qingwei Lin | Yingnong Dang | Saravan Rajmohan | Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Code translation benchmarks are essential for evaluating the accuracy and efficiency of LLM-based systems. Existing benchmarks mainly target individual functions, overlooking repository-level challenges like intermodule coherence and dependency management. Recent repository-level efforts exist, but suffer from poor maintainability and coarse evaluation granularity. We introduce Skeleton-Guided-Translation, a framework for benchmarking Java-to-C# translation at the repository level, featuring fine-grained quality evaluation. It follows a two-step process: first translating repository “skeletons”, then refining the entire repository guided by these skeletons. Based on this, we present TRANSREPO-BENCH , the first test-driven benchmark of high-quality Java repositories paired with C# skeletons, unit tests, and build configurations. Our adaptive unit tests support multiple and incremental translations without manual tuning, enhancing automation and scalability. We also propose fine-grained metrics that evaluate translation quality per test case, overcoming limitations of binary metrics in distinguishing build failures. Evaluations using TRANSREPO-BENCH reveal issues like broken cross-file references, showing that our structured approach reduces dependency errors and preserves interface consistency.
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Linghao Zhang | Junhao Wang | Shilin He | Chaoyun Zhang | Yu Kang | Bowen Li | Jiaheng Wen | Chengxing Xie | Maoquan Wang | Yufan Huang | Elsie Nallipogu | Qingwei Lin | Yingnong Dang | Saravan Rajmohan | Dongmei Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Linghao Zhang | Junhao Wang | Shilin He | Chaoyun Zhang | Yu Kang | Bowen Li | Jiaheng Wen | Chengxing Xie | Maoquan Wang | Yufan Huang | Elsie Nallipogu | Qingwei Lin | Yingnong Dang | Saravan Rajmohan | Dongmei Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.