XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Haoqi Yang; Yao Yao; Zuchao Li; Baoyuan Qi; Liu Guoming; Hai Zhao

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Liu Guoming, Hai Zhao

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy. The source code is available at https://github.com/brinenick511/XQuant.

Anthology ID:: 2025.emnlp-main.494
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9796–9811
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.494/
DOI:
Bibkey:
Cite (ACL):: Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Liu Guoming, and Hai Zhao. 2025. XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9796–9811, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression (Yang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.494.pdf
Checklist:: 2025.emnlp-main.494.checklist.pdf

PDF Cite Search Checklist Fix data