RoBSA: RoPE-based Blockwise Sparse Multi-head Latent Attention

Xinyu Shi; Kairong Luo; Zhen Zheng; Wenguang Chen

RoBSA: RoPE-based Blockwise Sparse Multi-head Latent Attention

Xinyu Shi, Kairong Luo, Zhen Zheng, Wenguang Chen

Abstract

Large Language Models (LLMs) have rapidly advanced in recent years, scaling up in both parameter count and context length. However, as context windows extend from thousands to hundreds of thousands of tokens, attention computation becomes the dominant source of memory usage and runtime in decoding stages, severely limiting the efficiency and scalability of long-context LLMs. Sparse attention has emerged as a promising solution, reducing complexity by computing attention over only a subset of context tokens. However, the sparse attention for Multi-head Latent Attention(MLA) which is a variant of standard MHA is rarely studied. In this paper, we introduce RoPE-based Blockwise Sparse Attention (RoBSA), a method designed specifically for MLA during the decoding stage of model inference. RoBSA leverages the decoupled nature of RoPE within MLA to implement token selection in a blockwise manner. RoBSA is a lightweight, training-free, and layer-aware algorithm that can be integrated in a plug-and-play fashion. Our method significantly reduces end-to-end inference latency in the decoding stage by up to 2.55x with minimal accuracy loss compared to full attention in long-context scenarios for very large models.

Anthology ID:: 2026.acl-long.46
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1028–1044
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.46/
DOI:
Bibkey:
Cite (ACL):: Xinyu Shi, Kairong Luo, Zhen Zheng, and Wenguang Chen. 2026. RoBSA: RoPE-based Blockwise Sparse Multi-head Latent Attention. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1028–1044, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: RoBSA: RoPE-based Blockwise Sparse Multi-head Latent Attention (Shi et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.46.pdf
Checklist:: 2026.acl-long.46.checklist.pdf

PDF Cite Search Checklist Fix data