Lei Bai
2026
A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement
Shengji Tang | Jianjian Cao | Weihao Lin | Jiale Hong | Bo Zhang | Shuyue Hu | Lei Bai | Tao Chen | Wanli Ouyang | Peng Ye
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shengji Tang | Jianjian Cao | Weihao Lin | Jiale Hong | Bo Zhang | Shuyue Hu | Lei Bai | Tao Chen | Wanli Ouyang | Peng Ye
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration–Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(**+5.36%**) and GPT-o3-mini(**+5.28%**) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (**+2.86%**), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.
MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs
Xiangyu Zhao | Wanghan Xu | Bo Liu | Yuhao Zhou | Fenghua Ling | Ben Fei | Xiaoyu Yue | Lei Bai | Wenlong Zhang | Xiao-Ming Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangyu Zhao | Wanghan Xu | Bo Liu | Yuhao Zhou | Fenghua Ling | Ben Fei | Xiaoyu Yue | Lei Bai | Wenlong Zhang | Xiao-Ming Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid advancement of multimodal large language models (MLLMs) offers new opportunities for complex scientific challenges, yet their application in earth science—especially at the graduate level—remains underexplored due to a lack of benchmarks reflecting the depth and complexity of geoscientific reasoning. Existing datasets often rely on synthetic data or simple figure-caption pairs, failing to capture the nuanced reasoning required for real-world applications. To address this, we introduce MSEarth, a multimodal scientific dataset and benchmark curated from high-quality, open-access publications. Covering the five major spheres of Earth science—atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere—MSEarth features over 289K figures with refined captions enriched by contextual discussions and reasoning from the original papers. The benchmark supports tasks such as scientific figure captioning, multiple choice questions, and open-ended reasoning, providing a scalable, high-fidelity resource for developing and evaluating MLLMs in scientific reasoning.
Nature-Inspired Population-Based Evolution of Large Language Models
Yiqun Zhang | Peng Ye | Xiaocui Yang | Shi Feng | Shufei Zhang | Lei Bai | Wanli Ouyang | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiqun Zhang | Peng Ye | Xiaocui Yang | Shi Feng | Shufei Zhang | Lei Bai | Wanli Ouyang | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evolution, the engine behind the survival and growth of life on Earth, operates through the population-based process of reproduction. Inspired by this principle, this paper formally defines a newly emerging problem: the population-based evolution of large language models (LLMs). We introduce a novel framework that starts with a population of parent LLMs and allows this population to evolve through four key operations: (i) crossover, merging the weights of different parents to create offspring LLMs, (ii) mutation, introducing small, random changes to model weights to foster diversity, (iii) selection, prioritizing high-performing models, and (iv) succession, transferring the learned experience from parent to offspring LLMs. With only 200 samples per new task, the LLM population evolves rapidly to adapt to the task at hand, without any gradients. Experiments on 12 datasets show that our framework consistently outperforms existing multi-LLM merging and adaptation methods, achieving relative performance gains of up to 54.8 over the best LLM in the initial population. Moreover, our framework allows for (i) the evolution of LLMs across multiple new tasks simultaneously, (ii) scaling effectively with populations of up to 40 LLMs, and (iii) even zero-shot generalization to unseen held-out tasks. Code: https://github.com/ZhangYiqun018/GENOME
R3: End-to-End Reasoning-based Planning for Multi-step Retrosynthesis via Reinforcement Learning
YiFei Wang | Qizhi Pei | Jiangtao Feng | Yuntian Shi | Yi Duan | Lihao Wang | Lei Bai | Lijun Wu | Wei-Ying Ma | Hao Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
YiFei Wang | Qizhi Pei | Jiangtao Feng | Yuntian Shi | Yi Duan | Lihao Wang | Lei Bai | Lijun Wu | Wei-Ying Ma | Hao Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step retrosynthetic planning is a fundamental challenge in organic chemistry, traditionally modeled as a combinatorial search problem guided by single-step prediction models. However, this search-centric paradigm often disconnects from the explicit chemical reasoning processes employed by human experts. In this paper, we propose R3 (Reinforced Reasoning Retrosynthesis), a novel framework that reformulates this task as end-to-end generative reasoning. Instead of traversing a search tree, R3 simulates the problem-solving logic of chemists to directly generate complete synthetic pathways. To achieve this, we initialize the model with domain knowledge and employ end-to-end Reinforcement Learning (RL) to optimize the entire planning policy. Experimental results on Retrobench show that R3 achieves a state-of-the-art Top-1 accuracy of 43.7%, demonstrating that generative reasoning offers a superior alternative to traditional search algorithms in solving complex retrosynthetic problems.
MTRouter: Cost-Aware Multi-Turn LLM Routing with History–Model Joint Embeddings
Yiqun Zhang | Hao Li | Zihan Wang | Shi Feng | Xiaocui Yang | Daling Wang | Bo Zhang | Lei Bai | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiqun Zhang | Hao Li | Zihan Wang | Shi Feng | Xiaocui Yang | Daling Wang | Bo Zhang | Lei Bai | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history–model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance–cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity’s Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models.Code: https://github.com/ZhangYiqun018/MTRouter
FlowSearch: Advancing Deep Research with Dynamic Structured Knowledge Flow
Yusong Hu | Runmin Ma | Yue Fan | Jinxin Shi | Zongsheng Cao | Yuhao Zhou | Jiakang Yuan | Shuaiyu Zhang | Shiyang Feng | Xiangchao Yan | Shufei Zhang | Wenlong Zhang | Lei Bai | Bo Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yusong Hu | Runmin Ma | Yue Fan | Jinxin Shi | Zongsheng Cao | Yuhao Zhou | Jiakang Yuan | Shuaiyu Zhang | Shiyang Feng | Xiangchao Yan | Shufei Zhang | Wenlong Zhang | Lei Bai | Bo Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Deep research is an inherently challenging task that demands both breadth and depth of thinking. It involves navigating diverse knowledge spaces and reasoning over complex, multi-step dependencies, which presents substantial challenges for agentic systems. To address this, we propose FlowSearch, a multi-agent framework that actively constructs and evolves a dynamic structured knowledge flow to drive subtask execution and reasoning. FlowSearch is capable of strategically planning and expanding the knowledge flow to enable parallel exploration and hierarchical task decomposition, while also adjusting the knowledge flow in real time based on feedback from intermediate reasoning outcomes and insights. FlowSearch achieves competitive performance on both general and scientific benchmarks, including GAIA, HLE, GPQA and TRQA, demonstrating its effectiveness in multi-disciplinary research scenarios and its potential to advance scientific discovery. The code will be available.
Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
Zhiyin Yu | Bo Zhang | Qibin Hou | Zhonghai Wu | Xiao Luo | Lei Bai
Findings of the Association for Computational Linguistics: ACL 2026
Zhiyin Yu | Bo Zhang | Qibin Hou | Zhonghai Wu | Xiao Luo | Lei Bai
Findings of the Association for Computational Linguistics: ACL 2026
Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL. The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. Specifically, we initialize a warm-up model using supervised RL with few-shot labeled data. This is followed by a divide-and-conquer pseudo-labeling strategy on difficult unlabeled data, combining consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases. Finally, difficulty-progressive self-training with iterative pseudo-labeling and RL further strengthens the model’s reasoning capability. EasyRL provides a unified self-evolving framework that facilitates data-efficient post-training of LLMs. Experimental results on mathematical and scientific benchmarks demonstrate that EasyRL, using only 10% of easy labeled data, consistently outperforms state-of-the-art baselines.
LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing
Hao Li | Yiqun Zhang | Zhaoyan Guo | Chenxu Wang | Shengji Tang | Qiaosheng Zhang | Yang Chen | Biqing Qi | Peng Ye | Lei Bai | Zhen Wang | Shuyue Hu
Findings of the Association for Computational Linguistics: ACL 2026
Hao Li | Yiqun Zhang | Zhaoyan Guo | Chenxu Wang | Shengji Tang | Qiaosheng Zhang | Yang Chen | Biqing Qi | Peng Ye | Lei Bai | Zhen Wang | Shuyue Hu
Findings of the Association for Computational Linguistics: ACL 2026
Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity—the central premise of LLM routing—we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.
A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
Zhiyin Yu | Yuchen Mou | Juncheng Yan | Junyu Luo | Chunchun Chen | Xing Wei | Yunhui Liu | Hongru Sun | Yuxing Zhang | Jun Xu | Yatao Bian | Ming Zhang | Wei Ye | Tieke He | Jie Yang | Guanjie Zheng | Zhonghai Wu | Bo Zhang | Lei Bai | Xiao Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiyin Yu | Yuchen Mou | Juncheng Yan | Junyu Luo | Chunchun Chen | Xing Wei | Yunhui Liu | Hongru Sun | Yuxing Zhang | Jun Xu | Yatao Bian | Ming Zhang | Wei Ye | Tieke He | Jie Yang | Guanjie Zheng | Zhonghai Wu | Bo Zhang | Lei Bai | Xiao Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL) has emerged as a powerful post-training paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, reinforcement learning for LLMs faces substantial data scarcity challenges, including the limited availability of high-quality external supervision and the constrained volume of model-generated experience. These limitations make data-efficient reinforcement learning a critical research direction. In this survey, we present the first systematic review of reinforcement learning for LLMs under data scarcity. We propose a bottom-up hierarchical framework built around three complementary perspectives: the data-centric perspective, the training-centric perspective, and the framework-centric perspective. We develop a taxonomy of existing methods, summarize representative approaches in each category, and analyze their strengths and limitations. Our taxonomy aims to provide a clear conceptual foundation for understanding the design space of data-efficient RL for LLMs and to guide researchers working in this emerging area. We hope this survey offers a comprehensive roadmap for future research and inspires new directions toward more efficient and scalable reinforcement learning post-training for LLMs.
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper investigates the scaling behavior of Large Language Model (LLM) reinforcement learning post-training, focusing on mathematical reasoning. Through experiments across the Qwen2.5 series (0.5B to 72B), we characterize how model scale, data, and compute interact. Our analysis yields four key findings: 1. Larger models consistently demonstrate superior compute and data efficiency. 2. The relationship between model performance and training resources follows a **predictive power-law** across both base and instruction-tuned models. 3. RL learning efficiency exhibits a latent **saturation trend** with increasing model scale. 4. In data-constrained regimes, performance is primarily driven by the **total volume of training data** rather than sample uniqueness. These results offer practical guidelines for scaling reasoning capabilities through reinforcement learning post-training.
2025
SURVEYFORGE : On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
Xiangchao Yan | Shiyang Feng | Jiakang Yuan | Renqiu Xia | Bin Wang | Lei Bai | Bo Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangchao Yan | Shiyang Feng | Jiakang Yuan | Renqiu Xia | Bin Wang | Lei Bai | Bo Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications. Recently, researchers have begun using LLMs to automate survey generation for better efficiency. However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gaps, we introduce SURVEYFORGE, which first generates the outline by analyzing the logical structure of human-written outlines and referring to the retrieved domain-related articles. Subsequently, leveraging high-quality papers retrieved from memory by our scholar navigation agent, SURVEYFORGE can automatically generate and refine the content of the generated article. Moreover, to achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison and assesses AI-generated survey papers across three dimensions: reference, outline, and content quality. Experiments demonstrate that SURVEYFORGEcan outperform previous works such as AutoSurvey.
ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks
Heng Zhou | Hejia Geng | Xiangyuan Xue | Li Kang | Yiran Qin | Zhiyong Wang | Zhenfei Yin | Lei Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Heng Zhou | Hejia Geng | Xiangyuan Xue | Li Kang | Yiran Qin | Zhiyong Wang | Zhenfei Yin | Lei Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multi-agent systems have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving. However, current MAS frameworks are limited by poor flexibility and scalability, with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process. The core of ReSo is the proposed Collaborative Reward Model, which can provide fine-grained reward signals for MAS cooperation for optimization. We also introduce an automated data synthesis framework for generating MAS benchmarks, without human annotations. Experimentally, ReSo matches or outperforms existing methods. ReSo achieves 33.7% and 32.3% accuracy on Math-MAS and SciBench-MAS SciBench, while other methods completely fail. The code and data are available at [Reso](https://github.com/hengzzzhou/ReSo).
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback
Jiakang Yuan | Xiangchao Yan | Bo Zhang | Tao Chen | Botian Shi | Wanli Ouyang | Yu Qiao | Lei Bai | Bowen Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiakang Yuan | Xiangchao Yan | Bo Zhang | Tao Chen | Botian Shi | Wanli Ouyang | Yu Qiao | Lei Bai | Bowen Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research. Dolphin first generates novel ideas based on feedback from previous experiments and relevant papers ranked by the topic and task attributes. Then, the generated ideas can be implemented using a code template refined and debugged with the designed exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and a subset of MLE-bench. Results show that Dolphin can continuously improve the performance of the input topic in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 3D point classification.
Search
Fix author
Co-authors
- Bo Zhang 7
- Shuyue Hu 4
- Wanli Ouyang 3
- Xiangchao Yan 3
- Peng Ye 3
- Jiakang Yuan 3
- Yiqun Zhang 3
- Tao Chen 2
- Shi Feng 2
- Shiyang Feng 2
- Hejia Geng 2
- Hao Li 2
- Xiao Luo 2
- Shengji Tang 2
- Zhonghai Wu 2
- Xiangyuan Xue 2
- Xiaocui Yang 2
- Zhenfei Yin 2
- Zhiyin Yu 2
- Wenlong Zhang 2
- Shufei Zhang 2
- Yuhao Zhou 2
- Heng Zhou 2
- Yatao Bian 1
- Jianjian Cao 1
- Zongsheng Cao 1
- Yang Chen 1
- Chunchun Chen 1
- Yi Duan 1
- Yue Fan 1
- Yutao Fan 1
- Ben Fei 1
- Jiangtao Feng 1
- Zhaoyan Guo 1
- Tieke He 1
- Qiang He 1
- Jiale Hong 1
- Qibin Hou 1
- Yusong Hu 1
- Li Kang 1
- Zhong-Zhi Li 1
- Weihao Lin 1
- Fenghua Ling 1
- Bo Liu 1
- Yunhui Liu 1
- Junyu Luo 1
- Wei-Ying Ma 1
- Runmin Ma 1
- Yuchen Mou 1
- Qizhi Pei 1
- Biqing Qi 1
- Yu Qiao 1
- Yiran Qin 1
- Botian Shi 1
- Yuntian Shi 1
- Jinxin Shi 1
- Hongru Sun 1
- Zelin Tan 1
- Philip Torr 1
- Guancheng Wan 1
- Bin Wang 1
- Zhiyong Wang 1
- Yifei Wang 1
- Lihao Wang 1
- Zihan Wang 1
- Daling Wang 1
- Chenxu Wang 1
- Zhen Wang 1
- Xing Wei 1
- Xiao-Ming Wu 1
- Lijun Wu 1
- Renqiu Xia 1
- Wanghan Xu 1
- Jun Xu 1
- Juncheng Yan 1
- Jie Yang 1
- Wei Ye 1
- Xiaohang Yu 1
- Xiaoyu Yue 1
- Shuaiyu Zhang 1
- Qiaosheng Zhang 1
- Yuxing Zhang 1
- Ming Zhang 1
- Mulei Zhang 1
- Zaibin Zhang 1
- Guibin Zhang 1
- Chen Zhang 1
- Xiangyu Zhao 1
- Guanjie Zheng 1
- Bowen Zhou 1
- Hao Zhou 1
- Yifan Zhou 1