Redko Dmitry
2026
Motivating Next-Gen Accelerators with Flexible N:M Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches
Shirin Alanova | Kristina Kazistova | Ekaterina Galaeva | Alina Kostromina | Vladimir Smirnov | Redko Dmitry | Alexey Dontsov | Maxim Zhelnin | Evgeny Burnaev | Egor Shvetsov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Shirin Alanova | Kristina Kazistova | Ekaterina Galaeva | Alina Kostromina | Vladimir Smirnov | Redko Dmitry | Alexey Dontsov | Maxim Zhelnin | Evgeny Burnaev | Egor Shvetsov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
The demand for efficient large language model inference has spurred interest in sparsification, yet current hardware support remains narrowly focused on 2:4 weight sparsity. In this work, we argue that activation sparsity despite being overlooked in hardware design offers a promising path for dynamic, input-adaptive compression with significant I/O and memory benefits. We present a comprehensive post-training study of N:M activation pruning across four LLMs (Llama2-7B-chat, Llama3.1-8B-Instruct, Qwen2.5-7B-Instruct, Gemma3-4B-Instruct), demonstrating that activation pruning consistently outperforms weight pruning at matched sparsity levels. We evaluate lightweight, plug-and-play error mitigation and selection strategies that require minimal or no calibration data across four sparsity patterns: 2:4, 4:8, 8:16, and 16:32. Among these, 16:32 approaches the performance of unstructured 50% sparsity and is is approximately 2.7× better than 2:4, while 8:16 offers an optimal balance of accuracy and practicality. Our results provide evidence that next-generation accelerators should consider native support for N:M activation sparsity and can serve as a strong baseline for the future methods.