S2O: Early Stopping for Sparse Attention via Online Permutation

Yu Zhang; Songwei Liu; Chenqian Yan; Linsheng; Beichen Ning; Fangmin Chen; Xing Wang

S2O: Early Stopping for Sparse Attention via Online Permutation

Yu Zhang, Songwei Liu, Chenqian Yan, Linsheng, Beichen Ning, Fangmin Chen, Xing Wang

Abstract

Attention scales quadratically with sequence length, fundamentally limiting long-context inference.Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs.We present S2O, which performs early stopping for sparse attention via online permutation.Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order.Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks.Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget.As a result, S2O substantially raises the practical sparsity ceiling.On Llama-3.1-8B under a 128K context, S2O reduces single-operator MSE by 3.82× at matched sparsity, and reduces prefill compute density by 3.31× at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves 7.51× attention and 3.81× end-to-end speedups.

Anthology ID:: 2026.acl-long.351
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7737–7751
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.351/
DOI:
Bibkey:
Cite (ACL):: Yu Zhang, Songwei Liu, Chenqian Yan, Linsheng, Beichen Ning, Fangmin Chen, and Xing Wang. 2026. S2O: Early Stopping for Sparse Attention via Online Permutation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7737–7751, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: S2O: Early Stopping for Sparse Attention via Online Permutation (Zhang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.351.pdf
Checklist:: 2026.acl-long.351.checklist.pdf

PDF Cite Search Checklist Fix data