2024
pdf
GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
Aashiq Muhamed
|
Oscar Li
|
David Woodruff
|
Mona T. Diab
|
Virginia Smith
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
pdf
abs
Less is Fed More: Sparsity Reduces Feature Distortion in Federated Learning
Abhinav Sukumar Rao
|
Aashiq Muhamed
|
Harshita Diddee
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Our work studies Multilingual Federated Learning (FL), a decentralized paradigm that, although promising, grapples with issues such as client drift and suboptimal generalization in diverse, multilingual settings. We highlight limitations in existing approaches to generalize across both actively participating and inactive client language pairs. To mitigate these challenges, we introduce FedSparseNet, which incorporates sparse-network training, and LoRA, based on Low-Rank Adaptation. These approaches maintain the model’s fidelity to its pretraining distribution, thereby ensuring robust performance on both seen and unseen language pairs, while simultaneously enhancing communication efficiency by selectively transmitting trainable parameters. Our empirical evaluations demonstrate that FedSparseNet outperforms conventional FL models on both seen and unseen clients, while LoRA shows remarkable improvements in unseen client performance. Additionally, we propose the Continuous Relative Robustness Metric, a novel metric to uniformly assess a model’s performance across diverse language pairs. We open-source our code for reproducibility on GitHub.
2023
pdf
abs
ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models
Jianyi Zhang
|
Aashiq Muhamed
|
Aditya Anantharaman
|
Guoyin Wang
|
Changyou Chen
|
Kai Zhong
|
Qingjun Cui
|
Yi Xu
|
Belinda Zeng
|
Trishul Chilimbi
|
Yiran Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Knowledge Distillation (KD) is one of the most effective approaches to deploying large-scale pre-trained language models in low-latency environments by transferring the knowledge contained in the large-scale models to smaller student models. Prior KD approaches use the soft labels and intermediate activations generated by the teacher to transfer knowledge to the student model parameters alone. In this paper, we show that having access to non-parametric memory in the form of a knowledge base with the teacher’s soft labels and predictions can further improve student generalization. To enable the student to retrieve from the knowledge base effectively, we propose a new framework and loss function that preserves the semantic similarities of teacher and student training examples. We show through extensive experiments that our retrieval mechanism can achieve state-of-the-art performance for task-specific knowledge distillation on the GLUE benchmark.