Yi Lu
Other people with similar names: Yi Lu
Unverified author pages with similar names: Yi Lu
2026
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
Zhuofeng Li | Yi Lu | Dongfu Jiang | Haoxiang Zhang | Yuyang Bai | Chuan Li | Yu Wang | Shuiwang Ji | Jianwen Xie | Yu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuofeng Li | Yi Lu | Dongfu Jiang | Haoxiang Zhang | Yuyang Bai | Chuan Li | Yu Wang | Shuiwang Ji | Jianwen Xie | Yu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce ReviewBench, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper’s content, and human-written reviews. We further propose ReviewGrounder, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on ReviewBench show that ReviewGrounder, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available at https://github.com/EigenTom/ReviewGrounder.
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
Songcheng Cai | Zhiheng Lyu | Yuansheng Ni | Xiangchao Chen | Baichuan Zhou | Shenzhe Zhu | Yi Lu | Haozhe Wang | Chi Ruan | Benjamin Schneider | Weixu Zhang | Xiang Li | Andy Zheng | Yuyu Zhang | Ping Nie | Wenhu Chen
Findings of the Association for Computational Linguistics: ACL 2026
Songcheng Cai | Zhiheng Lyu | Yuansheng Ni | Xiangchao Chen | Baichuan Zhou | Shenzhe Zhu | Yi Lu | Haozhe Wang | Chi Ruan | Benjamin Schneider | Weixu Zhang | Xiang Li | Andy Zheng | Yuyu Zhang | Ping Nie | Wenhu Chen
Findings of the Association for Computational Linguistics: ACL 2026
Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.