Zayd Muhammad Kawakibi Zuhri
2026
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri | Erland Hilman Fuadi | Alham Fikri Aji
Findings of the Association for Computational Linguistics: ACL 2026
Zayd Muhammad Kawakibi Zuhri | Erland Hilman Fuadi | Alham Fikri Aji
Findings of the Association for Computational Linguistics: ACL 2026
We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code: https://github.com/zaydzuhri/softpick-attention.
2025
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Zayd Muhammad Kawakibi Zuhri | Muhammad Farid Adilazuarda | Ayu Purwarianti | Alham Fikri Aji
Findings of the Association for Computational Linguistics: NAACL 2025
Zayd Muhammad Kawakibi Zuhri | Muhammad Farid Adilazuarda | Ayu Purwarianti | Alham Fikri Aji
Findings of the Association for Computational Linguistics: NAACL 2025
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV’s potential for efficient deployment of transformer models at scale.