Zibin Zheng
2026
From Logical to Computational Sparsity: Structure-Aware Block-Sparse Attention for Long-Code Completion
Yanli Wang | Yanlin Wang | Bowen Zhang | Yiwei Zhang | Daya Guo | Jiachi Chen | Hongyu Zhang | Zibin Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yanli Wang | Yanlin Wang | Bowen Zhang | Yiwei Zhang | Daya Guo | Jiachi Chen | Hongyu Zhang | Zibin Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code Large Language Models face critical Time-To-First-Token (TTFT) latency challenges when handling long code completion due to the quadratic complexity (O(n2)) of attention mechanisms. While existing sparse attention methods attempt to address this issue, they suffer from three key limitations: (1) general sparse patterns cause excessive accuracy degradation without considering code structure, (2) code-specific methods achieve only logical sparsity without actual computational speedup, and (3) limited adaptation to complex scenarios such as repository-level completion. We propose **SabreCoder**, a training-free **S**tructure-**a**ware **b**lock-spa**r**s**e** attention mechanism that bridges the gap between logical and computational sparsity. SabreCoder parses code into semantic chunks, constructs chunk-level sparse patterns through dependency analysis and similarity matching, and maps them to GPU-friendly block-sparse formats. Extensive experiments on LCC and CrossCodeEval benchmarks demonstrate that SabreCoder reduces TTFT by 45-55% while maintaining accuracy within 3% of dense attention.
2024
Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models
Wenqing Chen | Weicheng Wang | Zhixuan Chu | Kui Ren | Zibin Zheng | Zhichao Lu
Findings of the Association for Computational Linguistics: ACL 2024
Wenqing Chen | Weicheng Wang | Zhixuan Chu | Kui Ren | Zibin Zheng | Zhichao Lu
Findings of the Association for Computational Linguistics: ACL 2024
Recently, the self-consistency decoding strategy has shown the ability to improve performance for complex reasoning tasks with large language models (LLMs). However, the costs may be high because the sampling process of the strategy generates some low-probability text, resulting in low-quality reasoning paths. As a consequence, it requires a relatively large sampling number to obtain good aggregation performance. In this paper, we propose an alternative strategy, self-para-consistency. It first generates multiple paraphrases for each test question, then generates reasoning paths for the original and all the paraphrased questions based on greedy decoding, and finally selects the most consistent answer. Since all the candidate paths have relatively high probabilities, the sampling number could be much smaller than the self-consistency strategy. Extensive experiments on complex reasoning datasets demonstrate the effectiveness of our method in reducing the sampling number.