Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan
Abstract
With reasoning becoming the generative paradigm for large language models, the memory bottleneck caused by KV cache during the inference phase has become a critical factor limiting high-concurrency service capabilities. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95% of the performance of Full KV inference engines while delivering over 2.1 speedup.- Anthology ID:
- 2026.findings-acl.381
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7716–7737
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.381/
- DOI:
- Cite (ACL):
- Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Huaiyu Wan. 2026. Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7716–7737, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention (Liao et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.381.pdf