Wayne Luk
2025
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
Hao Mark Chen
|
Wayne Luk
|
Yiu Ka Fai Cedric
|
Rui Li
|
Konstantin Mishchenko
|
Stylianos Venieris
|
Hongxiang Fan
Findings of the Association for Computational Linguistics: EMNLP 2025
The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has explored various speculative decoding techniques for multi-token generation, these methods introduce high memory costs from the additional weights and KV cache of separate draft models, limiting efficiency in edge and long-context scenarios. To overcome these limitations in edge-scale LLMs, we propose a novel parallel prompt decoding that requires only runtime memory overhead by employing a unified single model for both speculation and verification. Inspired by the human natural language generation process, PPD approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. Furthermore, we present a hardware-aware two-stage tree pruning algorithm that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49 times speedup. Moreover, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to 1.22 times further speed improvement. To support future development, we have included our code implementation with this submission.
Search
Fix author
Co-authors
- Yiu Ka Fai Cedric 1
- Hao Mark Chen 1
- Hongxiang Fan 1
- Rui Li 1
- Konstantin Mishchenko 1
- show all...