Zhaoning Zhang
2025
Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
Libo Zhang
|
Zhaoning Zhang
|
Xubaizhou
|
Rui Li
|
Zhiliang Tian
|
Songzhu Mei
|
Dongsheng Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
With the continuous advancement in the performance of large language models (LLMs), their demand for computational resources and memory has significantly increased, which poses major challenges for efficient inference on consumer-grade devices and legacy servers. These devices typically feature relatively weaker GPUs and stronger CPUs. Although techniques such as parameter offloading and partial offloading can alleviate GPU memory pressure to some extent, their effectiveness is limited due to communication latency and suboptimal hardware resource utilization. To address this issue, we propose Dovetail—a lossless inference acceleration method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding. Dovetail deploys a draft model on the GPU to perform preliminary predictions, while a target model running on the CPU validates these outputs. By reducing the granularity of data transfer, Dovetail significantly minimizes communication overhead. To further improve efficiency, we optimize the draft model specifically for heterogeneous hardware environments by reducing the number of draft tokens to lower parallel verification latency, increasing model depth to enhance predictive capabilities, and introducing a Dynamic Gating Fusion (DGF) mechanism to improve the integration of feature and embedding information. We conduct comprehensive evaluations of Dovetail across various consumer-grade GPUs, covering multiple tasks and mainstream models. Experimental results on 13B models demonstrate that Dovetail achieves inference speedups ranging from 1.79× to 10.1× across different devices, while maintaining consistency and stability in the distribution of generated texts.
Correlation-Aware Example Selection for In-Context Learning with Nonsymmetric Determinantal Point Processes
Qiunan Du
|
Zhiliang Tian
|
Zhen Huang
|
Kailun Bian
|
Tianlun Liu
|
Zhaoning Zhang
|
Xinwang Liu
|
Feng Liu
|
Dongsheng Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
LLMs with in-context learning (ICL) obtain remarkable performance but are sensitive to the quality of ICL examples. Prior works on ICL example selection explored unsupervised heuristic methods and supervised LLM-based methods, but they typically focus on the selection of individual examples and ignore correlations among examples. Researchers use the determinantal point process (DPP) to model negative correlations among examples to select diverse examples. However, the DPP fails to model positive correlations among examples, while ICL still requires the positive correlations of examples to ensure the consistency of examples, which provides a clear instruction for LLMs. In this paper, we propose an ICL example selection method based on the nonsymmetric determinantal point process (NDPP) to capture positive and negative correlations, considering both the diversity and the relevance among ICL examples. Specifically, we optimize NDPP via kernel decomposition-based MLE to fit a constructed pseudo-labeled dataset, where we also propose a low-rank decomposition to reduce the computational cost. Further, we perform query-aware kernel adaptation on our NDPP to customize the input query, and we select examples via a MAP inference based on the adapted NDPP. Experimental results show our model outperforms strong baselines in ICL example selection.
Search
Fix author
Co-authors
- Dongsheng Li 2
- Zhiliang Tian 2
- Kailun Bian 1
- Qiunan Du 1
- Zhen Huang 1
- show all...