Bingli Wu
2026
EvoRoute: Experience-Driven Self-Routing LLM Agent Systems
Guibin Zhang | Haiyang Yu | Kaiming Yang | Bingli Wu | Fei Huang | Yongbin Li | Shuicheng Yan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Guibin Zhang | Haiyang Yu | Kaiming Yang | Bingli Wu | Fei Huang | Yongbin Li | Shuicheng Yan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Complex agentic AI systems, powered by a coordinated ensemble of Large Language Models (LLMs), tool and memory modules, have demonstrated remarkable capabilities on intricate, multi-turn tasks. However, this success is shadowed by prohibitive economic costs and severe latency, exposing a critical, yet underexplored, trade-off. We formalize this challenge as the Agent System Trilemma: the inherent tension among achieving state-of-the-art performance, minimizing monetary cost, and ensuring rapid task completion. To dismantle this trilemma, we introduce EvoRoute, a self-evolving model routing paradigm that transcends static, pre-defined model assignments. Leveraging an ever-expanding knowledge base of prior experience, EvoRoute dynamically selects Pareto-optimal LLM backbones at each step, balancing accuracy, efficiency, and resource use, while continually refining its own selection policy through environment feedback. Experiments on challenging agentic benchmarks such as GAIA and BrowseComp+ demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to 80% and latency by over 70%.
2024
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
Minzheng Wang | Longze Chen | Fu Cheng | Shengyi Liao | Xinghua Zhang | Bingli Wu | Haiyang Yu | Nan Xu | Lei Zhang | Run Luo | Yunshui Li | Min Yang | Fei Huang | Yongbin Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Minzheng Wang | Longze Chen | Fu Cheng | Shengyi Liao | Xinghua Zhang | Bingli Wu | Haiyang Yu | Nan Xu | Lei Zhang | Run Luo | Yunshui Li | Min Yang | Fei Huang | Yongbin Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Long-context modeling capabilities of Large Language Models (LLMs) have garnered widespread attention, leading to the emergence of LLMs with ultra-context windows. Meanwhile, benchmarks for evaluating long-context language models are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong’s test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Extensive experiments indicate that existing long-context language models still exhibit considerable potential for enhancement. Retrieval augmented generation (RAG) achieves poor performance, demonstrating that Loong can reliably assess the model’s long-context modeling capabilities.