Min Fang

2026

Autoregressive (AR) decoding in large language models (LLMs) is latency-bounded by strictly sequential token generation.Speculative decoding mitigates this bottleneck by letting a fast drafter propose multi-token candidates that are then verified in parallel by the target model; yet most existing systems still rely on AR drafters, limiting wall-clock gains.We present **DiffuSpec**, which repurposes a *diffusion language model* (DLM) as a *parallel* drafter to generate multi-token proposals in a single forward pass while remaining compatible with standard AR verifiers.However, DLM drafting presents unique challenges: 1) bidirectional conditioning produces a token lattice where locally optimal tokens may fail to form a valid causal sequence; 2) the mechanism requires tuning the draft length, which induces a speed–quality trade-off. To address these issues, we introduce (i) *Causal-consistency Path Search* (CPS) to extract verifier-aligned causal paths from the lattice, and (ii) an *Adaptive Draft-Length* (ADL) controller that adjusts proposal lengths using online acceptance feedback.Across benchmarks, DiffuSpec achieves up to 3× wall-clock speedup and consistently outperforms strong baselines, demonstrating diffusion-based drafting as a competitive alternative to AR drafters for speculative decoding.

Co-authors

Qibin Zhao 1

Venues

Findings1

Fix author