Junhao Hu
2026
Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage
Junhao Hu | Fangze Li | Mingtao Xu | Feifan Meng | Shiju Zhao | Tiancheng Hu | Ting Peng | Anmin Liu | Wenrui Huang | Chenxu Liu | Ziyue Hua | Tao Xie
Findings of the Association for Computational Linguistics: ACL 2026
Junhao Hu | Fangze Li | Mingtao Xu | Feifan Meng | Shiju Zhao | Tiancheng Hu | Ting Peng | Anmin Liu | Wenrui Huang | Chenxu Liu | Ziyue Hua | Tao Xie
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term “Less is Less” (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.
2025
RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning
Junhao Hu | Wenrui Huang | Weidong Wang | Zhenwen Li | Tiancheng Hu | Zhixia Liu | Xusheng Chen | Tao Xie | Yizhou Shan
Findings of the Association for Computational Linguistics: ACL 2025
Junhao Hu | Wenrui Huang | Weidong Wang | Zhenwen Li | Tiancheng Hu | Zhixia Liu | Xusheng Chen | Tao Xie | Yizhou Shan
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires an LLM to generate long sequences, incurring O(N) time and memory complexities per token, where N is the current sequence length. To reduce complexities, existing sparsity-based algorithms propose to retain Key-Value (KV) vectors, the intermediate representations of only the most critical tokens. However, these algorithms struggle with the “impossible trinity” of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with O(L) time but O(N) memory (L is the cache budget, L ≪ N). To address the “impossible trinity”, in this paper, we identify a new attention pattern during the decode stage of reasoning tasks, where milestone tokens (analogous to lemmas in mathematical proofs) emerge, are utilized, and then become unimportant afterward. Based on this pattern, we propose a new algorithm RaaS that identifies milestone tokens and retains their KV vectors until they are no longer needed, achieving high accuracy with O(L) time and O(L) memory complexities.