Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Mengqi Liao; Lu Wang; Chaoyun Zhang; Bo Qiao; Si Qin; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang; Huaiyu Wan

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan

Abstract

With reasoning becoming the generative paradigm for large language models, the memory bottleneck caused by KV cache during the inference phase has become a critical factor limiting high-concurrency service capabilities. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95% of the performance of Full KV inference engines while delivering over 2.1 speedup.

Anthology ID:: 2026.findings-acl.381
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7716–7737
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.381/
DOI:
Bibkey:
Cite (ACL):: Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Huaiyu Wan. 2026. Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7716–7737, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention (Liao et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.381.pdf
Checklist:: 2026.findings-acl.381.checklist.pdf

PDF Cite Search Checklist Fix data