Weilun Zhao
2025
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao
|
Tengyu Pan
|
Xu Han
|
Yudi Zhang
|
Sun Ao
|
Yuxiang Huang
|
Kaihuo Zhang
|
Weilun Zhao
|
Yuxuan Li
|
Jie Zhou
|
Hao Zhou
|
Jianyong Wang
|
Maosong Sun
|
Zhiyuan Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12× speedup over the state-of-the-art speculative sampling method EAGLE-2. Code is availableat https://github.com/thunlp/FR-Spec.
Search
Fix author
Co-authors
- Sun Ao 1
- Xu Han 1
- Yuxiang Huang 1
- Yuxuan Li 1
- Zhiyuan Liu 1
- show all...
Venues
- acl1