Devleena Das
2026
KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference
Sai Gokhale | Devleena Das | Rajeev Patwari | Ashish Sirasao | Elliott Delaye
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Sai Gokhale | Devleena Das | Rajeev Patwari | Ashish Sirasao | Elliott Delaye
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.
2025
Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation
Devleena Das | Rajeev Patwari | Ashish Sirasao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Devleena Das | Rajeev Patwari | Ashish Sirasao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.
2024
DEFT-UCS: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection for Text-Editing
Devleena Das | Vivek Khetan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Devleena Das | Vivek Khetan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent advances have led to the availability of many pre-trained language models (PLMs); however, a question that remains is how much data is truly needed to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT-UCS, a data-efficient fine-tuning framework that leverages unsupervised core-set selection to identify a smaller, representative dataset to fine-tune PLMs for text-generation needed for text editing tasks such as simplification, grammar correction, clarity, etc. We examine the efficacy of DEFT-UCS across multiple text-editing tasks, and compare to the state-of-the art text-editing model, CoEDIT. Our results demonstrate that DEFT-UCS models are just as accurate as CoEDIT, across eight different datasets consisting of six different editing tasks, while finetuned on 70% less data.