Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs
Runyu Peng, Yunhua Zhou, Kai Lv, Yang Gao, Qipeng Guo, Xipeng Qiu
Abstract
The rapid advancement of Large Language Models (LLMs) has significantly enhanced performance across various natural language processing (NLP) tasks, yet the high computational costs and latency associated with deploying such models continue to pose critical bottlenecks, limiting their broader applicability. To mitigate these challenges, we propose a dynamic hybrid inference framework, Firewall Routing, which efficiently selects between a strong and a weak LLMs based on the complexity of the query. A lightweight routing model is trained to optimize resource allocation by learning from response quality and preventing long-tail queries, which are often too hard to solve by LLMs, from being routed to the stronger model. Moreover, our method incorporates multiple sampling to enhance query evaluation reliability while leveraging Hard Blocking and Soft Blocking to handle long-tail queries along with refining labels for model selection. Extensive experiments show our method outperforms existing routing strategies by up to 5.29% in APGR, demonstrating state-of-the-art performance across multiple benchmarks.- Anthology ID:
- 2025.emnlp-main.331
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6540–6565
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.331/
- DOI:
- Cite (ACL):
- Runyu Peng, Yunhua Zhou, Kai Lv, Yang Gao, Qipeng Guo, and Xipeng Qiu. 2025. Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6540–6565, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs (Peng et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.331.pdf