Zhiyuan He
2026
Accelerating Prefilling via Decoding-time Contribution Sparsity
Zhiyuan He | Yike Zhang | Chengruidong Zhang | Huiqiang Jiang | Yuqing Yang | Lili Qiu
Findings of the Association for Computational Linguistics: ACL 2026
Zhiyuan He | Yike Zhang | Chengruidong Zhang | Huiqiang Jiang | Yuqing Yang | Lili Qiu
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding. Building on this observation, we propose TriangleMix, which replaces dense attention with Triangle attention in a subset of layers. Extensive experiments demonstrate that TriangleMix achieves near-lossless performance on both long-context and long-context reasoning benchmarks, while significantly improving efficiency. For 128K inputs, Triangle attention in the subset of layers achieves a 15.3 × speedup in attention kernel computation, significantly exceeding the acceleration of typical dynamic sparse methods ( 1.9 × to 3.4 × ). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6%–19% reduction in TTFT over using dynamic sparsity alone.
2025
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
Yike Zhang | Zhiyuan He | Huiqiang Jiang | Chengruidong Zhang | Yuqing Yang | Jianyong Wang | Lili Qiu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yike Zhang | Zhiyuan He | Huiqiang Jiang | Chengruidong Zhang | Yuqing Yang | Jianyong Wang | Lili Qiu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%–18% V cache memory reduction, and 1.45× decoding speedup. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is anonymously available at https://anonymous.4open.science/r/LeanK-7A87/README.md.
2024
Position Engineering: Boosting Large Language Models through Positional Information Manipulation
Zhiyuan He | Huiqiang Jiang | Zilong Wang | Yuqing Yang | Luna K. Qiu | Lili Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Zhiyuan He | Huiqiang Jiang | Zilong Wang | Yuqing Yang | Luna K. Qiu | Lili Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The performance of large language models (LLMs) is significantly influenced by the quality of the prompts provided. In response, researchers have developed enormous prompt engineering strategies aimed at modifying the prompt text to enhance task performance. In this paper, we introduce a novel technique termed position engineering, which offers a more efficient way to guide large language models. Unlike prompt engineering, which requires substantial effort to modify the text provided to LLMs, position engineering merely involves altering the positional information in the prompt without modifying the text itself. We have evaluated position engineering in two widely-used LLM scenarios: retrieval-augmented generation (RAG) and in-context learning (ICL). Our findings show that position engineering substantially improves upon the baseline in both cases. Position engineering thus represents a promising new strategy for exploiting the capabilities of large language models.