Nathan Susanj
2025
Wanda++: Pruning Large Language Models via Regional Gradients
Yifan Yang
|
Kai Zhen
|
Bhavana Ganesh
|
Aram Galstyan
|
Goeric Huybrechts
|
Markus Müller
|
Jonas M. Kübler
|
Rupak Vignesh Swaminathan
|
Athanasios Mouchtaris
|
Sravan Babu Bodapati
|
Nathan Susanj
|
Zheng Zhang
|
Jack FitzGerald
|
Abhishek Kumar
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level regional gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsity-aware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU.
2021
Revisiting Pretraining with Adapters
Seungwon Kim
|
Alex Shum
|
Nathan Susanj
|
Jonathan Hilgart
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Pretrained language models have served as the backbone for many state-of-the-art NLP results. These models are large and expensive to train. Recent work suggests that continued pretraining on task-specific data is worth the effort as pretraining leads to improved performance on downstream tasks. We explore alternatives to full-scale task-specific pretraining of language models through the use of adapter modules, a parameter-efficient approach to transfer learning. We find that adapter-based pretraining is able to achieve comparable results to task-specific pretraining while using a fraction of the overall trainable parameters. We further explore direct use of adapters without pretraining and find that the direct fine-tuning performs mostly on par with pretrained adapter models, contradicting previously proposed benefits of continual pretraining in full pretraining fine-tuning strategies. Lastly, we perform an ablation study on task-adaptive pretraining to investigate how different hyperparameter settings can change the effectiveness of the pretraining.