Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution
Yuhua Zhou, Shichao Weng, Changhai Zhou, Yuhan Wu, Qian Qiao, Jun Gao, Fei Yang, Aimin Pan
Abstract
While the massive scale of modern LLMs enables remarkable performance, their static, input-agnostic computational graph incurs substantial resource wastage and high latency during inference. Existing dynamic schemes, such as early-exit and layer-drop reduce FLOPs but break batch processing or introduce KV-cache inconsistency. We propose Deputy, a dynamic low-rank substitution framework that employs a lightweight decision module at each layer to dynamically determine the execution branch for different tokens: Attention layers choose between full and low-rank computation to mitigate the KV cache issue, while FFN layers additionally support skipping to further reduce computation. We fine-tune the LLM with LoRA and then derive an additional low-rank matrix C via a least-squares fit BC ≈ Wpre, where B is the shared LoRA matrix, so that only one extra low-rank matrix is introduced, effectively reducing memory overhead. Moreover, a hybrid KV cache strategy stores KV values generated by the low-rank branch, achieving a 38% reduction in cache storage. Experiments on Llama models demonstrate that Deputy reduces computation by approximately 40% compared to the original dense model while outperforming existing baseline methods.- Anthology ID:
- 2026.findings-acl.991
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19791–19810
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.991/
- DOI:
- Cite (ACL):
- Yuhua Zhou, Shichao Weng, Changhai Zhou, Yuhan Wu, Qian Qiao, Jun Gao, Fei Yang, and Aimin Pan. 2026. Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19791–19810, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Deputy: Accelerating Large Language Model Inference with Dynamic Low-Rank Substitution (Zhou et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.991.pdf