This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Zora ZhiruoWang
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
While language models (LMs) excel at generating code, many programs are difficult to generate using only parametric knowledge. Despite the success of retrieval-augmented generation (RAG) in text-centric tasks, its potential for code generation remains under-explored. This work introduces CodeRAG-bench, a holistic retrieval-augmented code generation benchmark covering tasks like basic programming, open-domain, and repository-level problems and provides reproducible evaluations on both retrieval and end-to-end code generation performance. We further create a diverse, open datastore for code retrieval, aggregating sources such as competition solutions, tutorials, library documentation, StackOverflow posts, and GitHub repositories. Based on CodeRAG-bench, we conduct large-scale evaluations of 10 retrievers and 10 LMs and systematically analyze when retrieval can benefit code generation models and identify remaining challenges. We find that while retrieving high-quality contexts improves code generation, retrievers often struggle to fetch useful contexts, and generators face limitations in using those contexts effectively. We hope CodeRAG-bench encourages further development in code-oriented RAG methods.
Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve factuality. However, a critical yet underexplored aspect of RAG pipelines is chunking—the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (cAST), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.
The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume ‘perfect’ information access and tool availability, which may not hold in the real world. To systematically study TaLMs imperfections, we introduce the FAIL-TaLMs benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TaLMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help method, to provide missing information or replace non-functional tools. While Ask-and-Help can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.
While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fallshort on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent’s capabilities effectively. We propose CowPilot, a frame- work supporting autonomous as well as human-agent co llaborative w eb navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent’s by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html