FLARE: Fine-Grained Length-Aware Routing for Resource-Efficient Heterogeneous LLM Serving

Yujia Fu; Heming Zhong; Dan Huang; Yutong Lu

FLARE: Fine-Grained Length-Aware Routing for Resource-Efficient Heterogeneous LLM Serving

Yujia Fu, Heming Zhong, Dan Huang, Yutong Lu

Abstract

With the rapid proliferation of large language models (LLMs), model pools have become increasingly heterogeneous in both capability and efficiency. Larger LLMs can improve quality but incur higher latency and cost, while smaller LLMs are the opposite, making per-query model selection crucial in practice. This has spawned LLM routers that dispatch each query to an appropriate model. Existing routers lack fine-grained resource awareness across deployment settings, which degrades efficiency metrics in real-world serving. To this end, We propose FLARE, a length-centric, resource-aware multi-LLM routing framework that uses length-based models to estimate per-query latency and cost. FLARE formulates routing as a discrete multi-objective optimization problem to achieve efficient trade-off. Experiments show that FLARE reduces latency and cost by up to 68% and 75% while maintaining competitive accuracy, and can be easily applied to new datasets and LLMs.

Anthology ID:: 2026.acl-long.1018
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22249–22266
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1018/
DOI:
Bibkey:
Cite (ACL):: Yujia Fu, Heming Zhong, Dan Huang, and Yutong Lu. 2026. FLARE: Fine-Grained Length-Aware Routing for Resource-Efficient Heterogeneous LLM Serving. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22249–22266, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: FLARE: Fine-Grained Length-Aware Routing for Resource-Efficient Heterogeneous LLM Serving (Fu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1018.pdf
Checklist:: 2026.acl-long.1018.checklist.pdf

PDF Cite Search Checklist Fix data