Guanghao Li
2026
Efficient Transformer Parameter Reuse via Zero-Token Mechanism
Guanghao Li | Wenhao Jiang | Li Shen | Ming Tang | Chun Yuan
Findings of the Association for Computational Linguistics: ACL 2026
Guanghao Li | Wenhao Jiang | Li Shen | Ming Tang | Chun Yuan
Findings of the Association for Computational Linguistics: ACL 2026
Resource constraints often limit the parameter capacity of Large Language Models (LLMs), thereby hindering their performance. Although existing approaches leverage parameter sharing to reuse a fixed set of parameters within constrained budgets, they typically require each layer to fulfill multiple roles over a fixed number of iterations. This design compromises both efficiency and adaptability. In this work, we propose the **Zero Token Transformer (ZTT)**, which employs a head-tail decoupled parameter cycling strategy. Specifically, we decouple the first (head) and last (tail) layers from the parameter cycling process, enabling iterative refinement solely within the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, wherein a virtual token with a trainable key and a zero-valued vector functions as a standard token. The resulting attention scores not only reflect the computational significance of each layer but also facilitate dynamic early exiting, thereby preserving overall model accuracy. Our approach achieves superior performance under strict parameter constraints, substantially reduces computational overhead via early exits, and can be seamlessly integrated into the fine-tuning of existing pre-trained models, improving both efficiency and adaptability.
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding
Guanghao Li | Zhihui Fu | Min Fang | Qibin Zhao | Ming Tang | Chun Yuan | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2026
Guanghao Li | Zhihui Fu | Min Fang | Qibin Zhao | Ming Tang | Chun Yuan | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2026
Autoregressive (AR) decoding in large language models (LLMs) is latency-bounded by strictly sequential token generation.Speculative decoding mitigates this bottleneck by letting a fast drafter propose multi-token candidates that are then verified in parallel by the target model; yet most existing systems still rely on AR drafters, limiting wall-clock gains.We present **DiffuSpec**, which repurposes a *diffusion language model* (DLM) as a *parallel* drafter to generate multi-token proposals in a single forward pass while remaining compatible with standard AR verifiers.However, DLM drafting presents unique challenges: 1) bidirectional conditioning produces a token lattice where locally optimal tokens may fail to form a valid causal sequence; 2) the mechanism requires tuning the draft length, which induces a speed–quality trade-off. To address these issues, we introduce (i) *Causal-consistency Path Search* (CPS) to extract verifier-aligned causal paths from the lattice, and (ii) an *Adaptive Draft-Length* (ADL) controller that adjusts proposal lengths using online acceptance feedback.Across benchmarks, DiffuSpec achieves up to 3× wall-clock speedup and consistently outperforms strong baselines, demonstrating diffusion-based drafting as a competitive alternative to AR drafters for speculative decoding.