AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity

Yu Zhang; Dong Guo; Fang Wu; Guoliang Zhu; Dian Ding; Yiming Zhang

AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity

Yu Zhang, Dong Guo, Fang Wu, Guoliang Zhu, Dian Ding, Yiming Zhang

Abstract

Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase, primarily due to the quadratic complexity of self-attention. Existing methods typically employ dynamic pattern matching and block-sparse low-level implementations. However, their reliance on local information for pattern identification fails to capture global contexts, and the coarse granularity of blocks leads to persistent internal sparsity, resulting in suboptimal accuracy and efficiency. To address these limitations, we propose AnchorAttention, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions at a finer stripe granularity while adapting to global contextual information, achieving superior speed and accuracy. AnchorAttention comprises three key components: (1) Pattern-based Anchor Computation, leveraging the commonalities present across all inputs to rapidly compute a set of near-maximum scores as anchor; (2) Difference-aware Stripe Sparsity Identification, performing difference-aware comparisons with anchor to quickly obtain discrete coordinates of significant regions in a stripe-like sparsity pattern; (3) Fine-grained Sparse Computation, replacing the traditional contiguous loading strategy with a discrete key-value loading approach to maximize sparsity rates while preserving hardware computational potential. Additionally, we integrate the identification strategy into a single operator to maximize parallelization potential. With its finer-grained sparsity strategy, AnchorAttention achieves higher sparsity rates at the same recall level, significantly reducing computation time. Compared to previous state-of-the-art methods, at a text length of 128k, it achieves a speedup of 1.44× while maintaining higher recall rates.

Anthology ID:: 2025.emnlp-main.430
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8548–8560
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.430/
DOI:
Bibkey:
Cite (ACL):: Yu Zhang, Dong Guo, Fang Wu, Guoliang Zhu, Dian Ding, and Yiming Zhang. 2025. AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8548–8560, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.430.pdf
Checklist:: 2025.emnlp-main.430.checklist.pdf

PDF Cite Search Checklist Fix data