Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs

Tao Ji; Bin Guo; Yuanbin Wu; Qipeng Guo; Shenlixing Shenlixing; Chenzhan Chenzhan; Xipeng Qiu (邱锡鹏); Qi Zhang; Tao Gui

Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs

Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Shenlixing Shenlixing, Chenzhan Chenzhan, Xipeng Qiu, Qi Zhang, Tao Gui

Abstract

Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (**MHA2MLA**), which includes two key components: for *partial-RoPE*, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for *low-rank approximation*, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.6% to 1%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 1% drop in LongBench performance. Our source code is publicly available at https://github.com/JT-Ushio/MHA2MLA.

Anthology ID:: 2025.acl-long.1597
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 33313–33328
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1597/
DOI:
Bibkey:
Cite (ACL):: Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Shenlixing Shenlixing, Chenzhan Chenzhan, Xipeng Qiu, Qi Zhang, and Tao Gui. 2025. Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33313–33328, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs (Ji et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1597.pdf

PDF Cite Search Fix data