Aditya Anantharaman


2026

Large language models (LLMs) have been widely deployed and have achieved remarkable success in downstream tasks. However, their high latency continues to pose challenges for real-time applications that require fast inference, and the need to train and deploy distinct models for different hardware constraints increases both financial and computational costs. To address this, we propose Nested Matrix Learning (NML), a method that trains a single, flexible model capable of generating multiple high-performing student models of varying sizes. This is achieved by simultaneously optimizing a pre-trained teacher model and its nested sub-models in a single training process, without sacrificing the teacher’s performance. NML provides a flexible and scalable solution, allowing models to adapt to different computational budgets. Our extensive experiments show that student models produced by NML, which can be up to 10x smaller than the full-size model, can be directly deployed for efficient inference or serve as superior initialization points for further fine-tuning in downstream tasks. By preserving the performance of the teacher model while delivering compact and efficient student models of various sizes, NML enhances the usability and adaptability of LLMs in real-world scenarios.

2023

Knowledge Distillation (KD) is one of the most effective approaches to deploying large-scale pre-trained language models in low-latency environments by transferring the knowledge contained in the large-scale models to smaller student models. Prior KD approaches use the soft labels and intermediate activations generated by the teacher to transfer knowledge to the student model parameters alone. In this paper, we show that having access to non-parametric memory in the form of a knowledge base with the teacher’s soft labels and predictions can further improve student generalization. To enable the student to retrieve from the knowledge base effectively, we propose a new framework and loss function that preserves the semantic similarities of teacher and student training examples. We show through extensive experiments that our retrieval mechanism can achieve state-of-the-art performance for task-specific knowledge distillation on the GLUE benchmark.