Yuru Wang
2025
ReviewRL: Towards Automated Scientific Review with RL
Sihang Zeng
|
Kai Tian
|
Kaiyan Zhang
|
Yuru Wang
|
Junqi Gao
|
Runze Liu
|
Sa Yang
|
Jingxuan Li
|
Xinwei Long
|
Jiaheng Ma
|
Biqing Qi
|
Bowen Zhou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.
Scalability of LLM-Based Multi-Agent Systems for Scientific Code Generation: A Preliminary Study
Yuru Wang
|
Kaiyan Zhang
|
Kai Tian
|
Sihang Zeng
|
Xingtai Lv
|
Ning Ding
|
Biqing Qi
|
Bowen Zhou
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)
Recent studies indicate that LLM-based Multi-Agent Systems (MAS) encounter scalability challenges in complex mathematical problem-solving or coding tasks, exhibiting issues such as inconsistent role adherence and ineffective inter-agent communication. Moreover, the performance advantages of LLM-based MAS over a single agent employing test-time scaling methods (e.g., majority voting) remain marginal. This raises a critical question: Can LLM-based MAS scale effectively to achieve performance comparable to standalone LLMs or even Large Reasoning Models (LRMs) under optimal test-time compute?In this paper, we conduct a preliminary investigation into the scalability of LLM-based MAS for scientific code generation. We propose a simple yet scalable two-player framework based on iterative critic-in-the-loop refinement. Our experiments demonstrate that a minimalist actor-critic framework based on DeepSeek-V3 can outperform DeepSeek-R1 under equivalent computational budgets. Surprisingly, more complex frameworks fail to yield significant gains. These findings corroborate recent insights into multi-agent system limitations and highlight the importance of scalable workflows for advancing scientific code generation.