Yuhan Wu
Other people with similar names: Yuhan Wu, Yuhan Wu
Unverified author pages with similar names: Yuhan Wu
2026
Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution
Yuhua Zhou | Shichao Weng | Changhai Zhou | Yuhan Wu | Qian Qiao | Jun Gao | Fei Yang | Aimin Pan
Findings of the Association for Computational Linguistics: ACL 2026
Yuhua Zhou | Shichao Weng | Changhai Zhou | Yuhan Wu | Qian Qiao | Jun Gao | Fei Yang | Aimin Pan
Findings of the Association for Computational Linguistics: ACL 2026
While the massive scale of modern LLMs enables remarkable performance, their static, input-agnostic computational graph incurs substantial resource wastage and high latency during inference. Existing dynamic schemes, such as early-exit and layer-drop reduce FLOPs but break batch processing or introduce KV-cache inconsistency. We propose Deputy, a dynamic low-rank substitution framework that employs a lightweight decision module at each layer to dynamically determine the execution branch for different tokens: Attention layers choose between full and low-rank computation to mitigate the KV cache issue, while FFN layers additionally support skipping to further reduce computation. We fine-tune the LLM with LoRA and then derive an additional low-rank matrix C via a least-squares fit BC ≈ Wpre, where B is the shared LoRA matrix, so that only one extra low-rank matrix is introduced, effectively reducing memory overhead. Moreover, a hybrid KV cache strategy stores KV values generated by the low-rank branch, achieving a 38% reduction in cache storage. Experiments on Llama models demonstrate that Deputy reduces computation by approximately 40% compared to the original dense model while outperforming existing baseline methods.