Cheng Longkai
Also published as: Longkai Cheng
2026
EDSD: Entropy-Driven Design for Faster Speculative Decoding
Longkai Cheng | Ximing Wang | Jiangcai Zhu | Kailai Shao | Chao Chen | Haixiang Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Longkai Cheng | Ximing Wang | Jiangcai Zhu | Kailai Shao | Chao Chen | Haixiang Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speculative decoding has emerged as a promising paradigm for accelerating large language model inference by leveraging a lightweight draft model to generate multiple candidate tokens. However, existing methods often incur substantial training overhead to mitigate information misalignment between autoregressive draft model training and decoding. To address this challenge, we propose EDSD, an Entropy-Driven Speculative Decoding framework that uses entropy as a unified, interpretable signal for both draft model training and architectural design. EDSD drives the draft model to progressively align with the target model in an easy-to-hard manner while establishing token-level alignment as a dominant design principle. Extensive experiments on seven LLMs demonstrate that EDSD improves training efficiency by 24.8%, increases the average acceptance length by 4.0%, and achieves a 4.1% speedup compared to state-of-the-art methods. Furthermore, EDSD improves robustness to system prompt variations by more than 5x. Our findings establish entropy-driven alignment as an effective and principled foundation for efficient speculative decoding.
2025
HookMoE: A learnable performance compensation strategy of Mixture-of-Experts for LLM inference acceleration
Cheng Longkai | Along He | Mulin Li | Xie Xueshuo | Tao Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Cheng Longkai | Along He | Mulin Li | Xie Xueshuo | Tao Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Mixture of Experts (MoE) architectures have emerged as a promising paradigm for scaling model capacity through top-k routing mechanisms. Although reducing the number of activated experts inherently enables inference acceleration, this efficiency gain typically comes at the cost of significant performance degradation. To address this trade-off between efficiency and performance, we propose HookMoE, a plug-and-play single-layer compensation framework that effectively restores performance using only a small post-training calibration set. Our method strategically inserts a lightweight trainable Hook module immediately preceding selected transformer blocks. Comprehensive evaluations on four popular MoE models, with an average performance degradation of only 2.5% across various benchmarks, our method reduces the number of activated experts by more than 50% and achieves a 1.42× inference speed-up during the prefill stage. Through systematic analysis, we further reveal that the upper layers require fewer active experts, offering actionable insights for refining dynamic expert selection strategies and enhancing the overall efficiency of MoE models.