Xianjie Shi


2026

Large language models (LLMs) excel at general programming but struggle with domain-specific software development. This gap motivates research into domain specialization methods that enable LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-bench, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-bench contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice Q A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-bench requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from the corpora to solve evaluation tasks. Our evaluations reveal that KOCO-bench poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-bench, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO-bench.

2024

Large Language Models (LLMs) have shown promise in automated code generation but typically excel only in simpler tasks such as generating standalone code units. However, real-world software development often involves complex code repositories with complex dependencies and extensive documentation. To enable LLMs to handle these realworld repo-level code generation, we present CodeAgent, a novel LLM-based agent framework that employs external tools for effective repo-level code generation. CodeAgent integrates five programming tools, enabling interaction with software artifacts for information retrieval, code implementation, and code testing. We implement four agent strategies to optimize these tools’ usage. To the best of our knowledge, CodeAgent is the first agent tool framework specifically for repo-level code generation. In order to measure the effectiveness of our method at the repository level, we have introduced a benchmark dataset CodAgentBench. The performance on this dataset shows a significant improvement brought by our method, with improvements of pass rate ranging from 2.0 to 15.8. Further tests on the HumanEval benchmark confirm CodeAgent’s adaptability and efficacy across various code generation tasks. Notably, CodeAgent outperforms commercial products like Github Copilot, showcasing superior accuracy and efficiency. These results demonstrate CodeAgent’s robust capabilities in code generation, highlighting its potential for real-world repo-level coding challenges.