2025
pdf
bib
abs
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Linghao Zhang
|
Junhao Wang
|
Shilin He
|
Chaoyun Zhang
|
Yu Kang
|
Bowen Li
|
Jiaheng Wen
|
Chengxing Xie
|
Maoquan Wang
|
Yufan Huang
|
Elsie Nallipogu
|
Qingwei Lin
|
Yingnong Dang
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
pdf
bib
abs
UFO: A UI-Focused Agent for Windows OS Interaction
Chaoyun Zhang
|
Liqun Li
|
Shilin He
|
Xu Zhang
|
Bo Qiao
|
Si Qin
|
Minghua Ma
|
Yu Kang
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We introduce UFO, a UI-Fcused agent designed to fulfill user requests tailored to Windows OS applications by observing and analyzing the GUI and control information of these applications. UFO utilizes a hierarchical dual-agent framework that decomposes user requests using a divide-and-conquer approach, enabling seamless navigation and addressing sub-tasks across multiple applications. It also incorporates a control interaction module tailored for Windows OS, which detects control elements effectively and allows for fully automated execution. As a result, UFO simplifies complex and time-consuming processes into tasks that can be completed with natural language commands.We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios. The results derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFOin fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS.
2024
pdf
bib
abs
QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback based Self-Correction
Xiang Huang
|
Sitao Cheng
|
Shanshan Huang
|
Jiayu Shen
|
Yong Xu
|
Chaoyun Zhang
|
Yuzhong Qu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Employing Large Language Models (LLMs) for semantic parsing has achieved remarkable success. However, we find existing methods fall short in terms of reliability and efficiency when hallucinations are encountered. In this paper, we address these challenges with a framework called QueryAgent, which solves a question step-by-step and performs stepwise self-correction. We introduce an environmental feedback-based self-correction method called ERASER. Unlike traditional approaches, ERASER leverages rich environmental feedback in the intermediate steps to perform selective and differentiated self-correction only when necessary. Experimental results demonstrate that QueryAgent notably outperforms all previous few-shot methods using only one example on GrailQA and GraphQ by 5.7 and 15.0 points. Furthermore, our approach exhibits superiority in terms of efficiency, including run-time, query overhead, and API invocation costs. By leveraging ERASER, we further improve another baseline (i.e., AgentBench) by approximately 10 points, validating the strong transferability of our approach.
pdf
bib
abs
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
Ruomeng Ding
|
Chaoyun Zhang
|
Lu Wang
|
Yong Xu
|
Minghua Ma
|
Wei Zhang
|
Si Qin
|
Saravan Rajmohan
|
Qingwei Lin
|
Dongmei Zhang
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduce a novel thought prompting approach called ”Everything of Thoughts” (XoT) for Large Language Models (LLMs) to defy the law of ”Penrose triangle” of existing thought paradigms, to achieve three key perspectives in thought generation simultaneously: performance, efficiency, and flexibility. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge and planning capability into thoughts, thereby enhancing LLMs’ decision-making capabilities. Through the MCTS-LLM collaborative thought revision framework, XoT autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to utilize flexible cognitive mappings for solving problems with multiple solutions.We evaluate XoT on several challenging problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube. Our results demonstrate that XoT significantly outperforms existing approaches in various dimensions, showcasing its remarkable proficiency in addressing complex problems across diverse domains. The data and code are available at https://github.com/microsoft/Everything-of-Thoughts-XoT.
pdf
bib
abs
Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments
Sitao Cheng
|
Ziyuan Zhuang
|
Yong Xu
|
Fangkai Yang
|
Chaoyun Zhang
|
Xiaoting Qin
|
Xiang Huang
|
Ling Chen
|
Qingwei Lin
|
Dongmei Zhang
|
Saravan Rajmohan
|
Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2024
Large Language Models (LLMs) have shown potential in reasoning over structured environments, e.g., knowledge graphs and tables. Such tasks typically require multi-hop reasoning, i.e., match natural language utterance with instances in the environment. Previous works adopt LLMs to incrementally build a reasoning path, where LLMs either invoke tools or pick up items by step-by-step interacting with the environment. We propose Reasoning-Path-Editing (Readi), a novel framework where LLMs can efficiently and faithfully reason over structured environments. In Readi, LLMs initially generate a reasoning path given a query, and edit the path only when necessary. We instantiate the path on structured environments and provide feedback to edit the path if anything goes wrong. Experimental results on three KGQA and two TableQA datasets show the effectiveness of Readi, significantly surpassing previous LLM-based methods (by 9.1% Hit@1 on WebQSP, 12.4% on MQA-3H and 9.5% on WTQ), comparable with state-of-the-art fine-tuned methods (67% on CWQ and 74.7% on WebQSP) and substantially boosting the vanilla LLMs (by 14.9% on CWQ). Our code will be available on
https://aka.ms/readi.