Chao Yang
Other people with similar names: Chao Yang
Unverified author pages with similar names: Chao Yang
2026
RRAtention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
Siran Liu | Guoxia Wang | Sa Wang | Jinle Zeng | Haoyang Xie | Siyu Lou | Jiabin Yang | Dianhai Yu | Haifeng Wang | Chao Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siran Liu | Guoxia Wang | Sa Wang | Jinle Zeng | Haoyang Xie | Siyu Lou | Jiabin Yang | Dianhai Yu | Haifeng Wang | Chao Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head **r**ound-**r**obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from O(L2) to O(L2/S2) and employs adaptive Top-𝜏 selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99% of full attention performance while computing only half of the attention blocks, achieving 2.4× speedup at 128K context length and outperforming existing dynamic sparse attention methods. The code is available at [https://github.com/PaddlePaddle/PaddleFleet](https://github.com/PaddlePaddle/PaddleFleet) (see ‘Research/RRAttention‘).