Shuyue Hu
2026
LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing
Hao Li | Yiqun Zhang | Zhaoyan Guo | Chenxu Wang | Shengji Tang | Qiaosheng Zhang | Yang Chen | Biqing Qi | Peng Ye | Lei Bai | Zhen Wang | Shuyue Hu
Findings of the Association for Computational Linguistics: ACL 2026
Hao Li | Yiqun Zhang | Zhaoyan Guo | Chenxu Wang | Shengji Tang | Qiaosheng Zhang | Yang Chen | Biqing Qi | Peng Ye | Lei Bai | Zhen Wang | Shuyue Hu
Findings of the Association for Computational Linguistics: ACL 2026
Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity—the central premise of LLM routing—we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.
Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation
Zhiyao Cui | Chenxu Wang | Shuyue Hu | Yiqun Zhang | Wenqi Shao | Qiaosheng Zhang | Zhen Wang
Findings of the Association for Computational Linguistics: ACL 2026
Zhiyao Cui | Chenxu Wang | Shuyue Hu | Yiqun Zhang | Wenqi Shao | Qiaosheng Zhang | Zhen Wang
Findings of the Association for Computational Linguistics: ACL 2026
Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper: (1) proposes a hierarchical slides generation workflow DeepSlides that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models SlideQwens for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at: https://anonymous.4open.science/r/DeepSlides-D14D
A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement
Shengji Tang | Jianjian Cao | Weihao Lin | Jiale Hong | Bo Zhang | Shuyue Hu | Lei Bai | Tao Chen | Wanli Ouyang | Peng Ye
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shengji Tang | Jianjian Cao | Weihao Lin | Jiale Hong | Bo Zhang | Shuyue Hu | Lei Bai | Tao Chen | Wanli Ouyang | Peng Ye
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration–Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(**+5.36%**) and GPT-o3-mini(**+5.28%**) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (**+2.86%**), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.
MTRouter: Cost-Aware Multi-Turn LLM Routing with History–Model Joint Embeddings
Yiqun Zhang | Hao Li | Zihan Wang | Shi Feng | Xiaocui Yang | Daling Wang | Bo Zhang | Lei Bai | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiqun Zhang | Hao Li | Zihan Wang | Shi Feng | Xiaocui Yang | Daling Wang | Bo Zhang | Lei Bai | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history–model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance–cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity’s Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models.Code: https://github.com/ZhangYiqun018/MTRouter
Nature-Inspired Population-Based Evolution of Large Language Models
Yiqun Zhang | Peng Ye | Xiaocui Yang | Shi Feng | Shufei Zhang | Lei Bai | Wanli Ouyang | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiqun Zhang | Peng Ye | Xiaocui Yang | Shi Feng | Shufei Zhang | Lei Bai | Wanli Ouyang | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evolution, the engine behind the survival and growth of life on Earth, operates through the population-based process of reproduction. Inspired by this principle, this paper formally defines a newly emerging problem: the population-based evolution of large language models (LLMs). We introduce a novel framework that starts with a population of parent LLMs and allows this population to evolve through four key operations: (i) crossover, merging the weights of different parents to create offspring LLMs, (ii) mutation, introducing small, random changes to model weights to foster diversity, (iii) selection, prioritizing high-performing models, and (iv) succession, transferring the learned experience from parent to offspring LLMs. With only 200 samples per new task, the LLM population evolves rapidly to adapt to the task at hand, without any gradients. Experiments on 12 datasets show that our framework consistently outperforms existing multi-LLM merging and adaptation methods, achieving relative performance gains of up to 54.8 over the best LLM in the initial population. Moreover, our framework allows for (i) the evolution of LLMs across multiple new tasks simultaneously, (ii) scaling effectively with populations of up to 40 LLMs, and (iii) even zero-shot generalization to unseen held-out tasks. Code: https://github.com/ZhangYiqun018/GENOME
2025
Reinforcement Learning for Large Language Models via Group Preference Reward Shaping
Huaisheng Zhu | Siyuan Xu | Hangfan Zhang | Teng Xiao | Zhimeng Guo | Shijie Zhou | Shuyue Hu | Vasant G. Honavar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Huaisheng Zhu | Siyuan Xu | Hangfan Zhang | Teng Xiao | Zhimeng Guo | Shijie Zhou | Shuyue Hu | Vasant G. Honavar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) require alignment via reinforcement learning (RL) to effectively perform task-specific objectives, such as human preference alignment and enhanced reasoning. While Proximal Policy Optimization (PPO) is widely adopted, its computational overhead, stemming from additional value model requirements, limits applicability. Existing alternatives, like Group Relative Policy Optimization (GRPO), mitigate computational costs but remain sensitive to reward model quality. To address this, we introduce Group Preference Reward Shaping (GPRS), a novel method that leverages preference-based comparisons rather than precise numerical rewards. GPRS requires no extra model components and remains robust across varying reward model sizes and qualities. Extensive experiments demonstrate that GPRS consistently outperforms existing critic-model-free RL algorithms in Reinforcement Learning from Human Feedback (RLHF) and reasoning tasks, providing stable and good alignment performance.
Search
Fix author
Co-authors
- Lei Bai 4
- Yiqun Zhang 4
- Peng Ye 3
- Shi Feng 2
- Hao Li 2
- Wanli Ouyang 2
- Shengji Tang 2
- Chenxu Wang 2
- Zhen Wang 2
- Xiaocui Yang 2
- Bo Zhang 2
- Qiaosheng Zhang 2
- Jianjian Cao 1
- Tao Chen 1
- Yang Chen 1
- Zhiyao Cui 1
- Zhaoyan Guo 1
- Zhimeng Guo 1
- Vasant G. Honavar 1
- Jiale Hong 1
- Weihao Lin 1
- Biqing Qi 1
- Wenqi Shao 1
- Daling Wang 1
- Zihan Wang 1
- Teng Xiao 1
- Siyuan Xu 1
- Hangfan Zhang 1
- Shufei Zhang 1
- Shijie Zhou 1
- Huaisheng Zhu 1