Bogdan Nicolae
2026
PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
Krishna Teja Chitty-Venkata | Jie Ye | Siddhisanket Raskar | Anthony Kougkas | Xian Sun | Murali Emani | Venkatram Vishwanath | Bogdan Nicolae
Findings of the Association for Computational Linguistics: EACL 2026
Krishna Teja Chitty-Venkata | Jie Ye | Siddhisanket Raskar | Anthony Kougkas | Xian Sun | Murali Emani | Venkatram Vishwanath | Bogdan Nicolae
Findings of the Association for Computational Linguistics: EACL 2026
KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM’s PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.
2025
CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation
Ziyue Liu | Ruijie Zhang | Zhengyang Wang | Mingsong Yan | Zi Yang | Paul D. Hovland | Bogdan Nicolae | Franck Cappello | Sui Tang | Zheng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ziyue Liu | Ruijie Zhang | Zhengyang Wang | Mingsong Yan | Zi Yang | Paul D. Hovland | Bogdan Nicolae | Franck Cappello | Sui Tang | Zheng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The full-size MLPs and the projection layers in attention introduce tremendous model sizes of large language models (LLMs), consuming extensive computational resources in pre-training. We empirically observe that the activations of pre-trained LLMs exhibit low-rank property. Motivated by such observations, we propose **CoLA** and its memory-efficient implementation, **CoLA-M**, to replace these full-size layers with compute-efficient **auto-encoders** that naturally enforce low-rank activations throughout training. This fundamental architectural change eliminates the activation redundancy and significantly boosts model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by 2\pmb{\times} and improves training throughput by 1.86\pmb{\times} while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also 2\pmb{\times} smaller, enabling faster inference with lower memory cost on resource-constrained platforms.