Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference

Libo Zhang; Zhaoning Zhang; Xubaizhou; Rui Li; Zhiliang Tian; Songzhu Mei; Dongsheng Li

Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference

Libo Zhang, Zhaoning Zhang, Xubaizhou, Rui Li, Zhiliang Tian, Songzhu Mei, Dongsheng Li

Abstract

With the continuous advancement in the performance of large language models (LLMs), their demand for computational resources and memory has significantly increased, which poses major challenges for efficient inference on consumer-grade devices and legacy servers. These devices typically feature relatively weaker GPUs and stronger CPUs. Although techniques such as parameter offloading and partial offloading can alleviate GPU memory pressure to some extent, their effectiveness is limited due to communication latency and suboptimal hardware resource utilization. To address this issue, we propose Dovetail—a lossless inference acceleration method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding. Dovetail deploys a draft model on the GPU to perform preliminary predictions, while a target model running on the CPU validates these outputs. By reducing the granularity of data transfer, Dovetail significantly minimizes communication overhead. To further improve efficiency, we optimize the draft model specifically for heterogeneous hardware environments by reducing the number of draft tokens to lower parallel verification latency, increasing model depth to enhance predictive capabilities, and introducing a Dynamic Gating Fusion (DGF) mechanism to improve the integration of feature and embedding information. We conduct comprehensive evaluations of Dovetail across various consumer-grade GPUs, covering multiple tasks and mainstream models. Experimental results on 13B models demonstrate that Dovetail achieves inference speedups ranging from 1.79× to 10.1× across different devices, while maintaining consistency and stability in the distribution of generated texts.

Anthology ID:: 2025.emnlp-main.879
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17393–17406
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.879/
DOI:
Bibkey:
Cite (ACL):: Libo Zhang, Zhaoning Zhang, Xubaizhou, Rui Li, Zhiliang Tian, Songzhu Mei, and Dongsheng Li. 2025. Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17393–17406, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.879.pdf
Checklist:: 2025.emnlp-main.879.checklist.pdf

PDF Cite Search Checklist Fix data