Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Qingyun Jin; Xiaohui Song; Feng Zhou; Zengchang Qin

doi:10.18653/v1/2025.findings-emnlp.467

Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin

Abstract

Large language models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, as the model size and the input sequence’s length increase, the linearly increasing key-value (KV) cache significantly degrades inference throughput. Therefore, grouped-query attention (GQA), as an alternative to multi-head attention (MHA), has been widely introduced into LLMs. In this work, we propose a cost-effective method for converting MHA into GQA with any compression ratio of KV heads. The key point of our method lies in the application of Procrustes analysis to the attention heads, which enhances the similarity among attention heads while preserving computational invariance, thereby improving the model’s post-training performance. Subsequently, we employ L₀ regularization to prune redundant parameters. The model after pruning can be adapted to the standard GQA framework. Experimental results show that our strategy can compress up to 87.5% KV heads of LLaMA2-7B model and 75% KV heads of Sheared-LLaMA-1.3B with acceptable performance degradation. Our code is released at https://github.com/fpcsong/mha2gqa.

Anthology ID:: 2025.findings-emnlp.467
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8804–8816
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.467/
DOI:: 10.18653/v1/2025.findings-emnlp.467
Bibkey:
Cite (ACL):: Qingyun Jin, Xiaohui Song, Feng Zhou, and Zengchang Qin. 2025. Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8804–8816, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA (Jin et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.467.pdf
Checklist:: 2025.findings-emnlp.467.checklist.pdf

PDF Cite Search Checklist Fix data