Jucheng Shen

2026

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate on the dynamic nature of the token unmasking confidence across blocks and steps. Based on this observation, we then present a lightweight adaptive approach that can control the generation block size, step size, and threshold based on the average confidence score of the unmasked tokens. We further reduce the softmaxing overhead of token probability generation by dynamically leveraging a subset of vocabulary size to regulate sampling breadth. CadLLM is a plug-and-play model-agnostic with KV caching based dLLMs. Extensive experiments on four popular tasks demonstrate the efficacy of CadLLM to yield throughput improvement of up to 1.1-2.28x over the state-of-the-art baseline with competitive accuracy.

Co-authors

Zhangyang Wang 1

Venues

Findings1

Fix author