Kaihuo Zhang
2025
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao
|
Tengyu Pan
|
Xu Han
|
Yudi Zhang
|
Sun Ao
|
Yuxiang Huang
|
Kaihuo Zhang
|
Weilun Zhao
|
Yuxuan Li
|
Jie Zhou
|
Hao Zhou
|
Jianyong Wang
|
Maosong Sun
|
Zhiyuan Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12× speedup over the state-of-the-art speculative sampling method EAGLE-2. Code is availableat https://github.com/thunlp/FR-Spec.
2024
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao
|
Yuxiang Huang
|
Xu Han
|
Wang Xu
|
Chaojun Xiao
|
Xinrong Zhang
|
Yewei Fang
|
Kaihuo Zhang
|
Zhiyuan Liu
|
Maosong Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Speculative decoding is a widely used method that accelerates the generation process of large language models (LLMs) with no compromise in model performance. It achieves this goal by using an existing smaller model for drafting and then employing the target LLM to verify the draft in a low-cost parallel manner. Under such a drafting-verification framework, drafting efficiency has become a bottleneck in the final speedup of speculative decoding. Therefore, generating longer drafts at less cost can lead to better decoding speedup. To achieve this, we introduce Ouroboros, which can generate draft phrases to parallelize the drafting process and meanwhile lengthen drafts in a training-free manner. The experimental results on various typical text generation tasks show that Ouroboros can achieve speedups of up to 2.4× over speculative decoding and 3.9× over vanilla decoding, without fine-tuning draft and target models. Code available at https://github.com/thunlp/Ouroboros.
Search
Fix author
Co-authors
- Xu Han 2
- Yuxiang Huang 2
- Zhiyuan Liu 2
- Maosong Sun (孙茂松) 2
- Weilin Zhao 2
- show all...