QSpec: Speculative Decoding with Complementary Quantization Schemes

Juntao Zhao; Wenhao Lu; Sheng Wang; Lingpeng Kong; Chuan Wu

QSpec: Speculative Decoding with Complementary Quantization Schemes

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu

Abstract

Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation-weight joint quantization enables efficient low-precision decoding, it suffers substantial performance degradation on multi-step reasoning tasks. We propose QSPEC, a novel quantization paradigm that decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification. QSPEC reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models. Compared to high-precision baselines, QSPEC achieves up to 1.64x speedup without quality degradation, and outperforms state-of-the-art speculative decoding methods by up to 1.55x in batched settings. Furthermore, QSPEC supports plug-and-play deployment and generalizes well across model scales, quantization methods, and workloads. These properties make QSPEC a practical and scalable solution for high-fidelity quantized LLM serving under memory-constrained scenarios.

Anthology ID:: 2025.emnlp-main.240
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4779–4795
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.240/
DOI:
Bibkey:
Cite (ACL):: Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, and Chuan Wu. 2025. QSpec: Speculative Decoding with Complementary Quantization Schemes. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4779–4795, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: QSpec: Speculative Decoding with Complementary Quantization Schemes (Zhao et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.240.pdf
Checklist:: 2025.emnlp-main.240.checklist.pdf

PDF Cite Search Checklist Fix data