2024
pdf
abs
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics
Arash Ardakani
|
Altan Haan
|
Shangyin Tan
|
Doru Thom Popovici
|
Alvin Cheung
|
Costin Iancu
|
Koushik Sen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks. However, these models are extremely memory intensive during their fine-tuning process, making them difficult to deploy on GPUs with limited memory resources. To address this issue, we introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics and freezing less-contributory layers during fine-tuning. The layers to freeze are chosen using a runtime inter-layer scheduling algorithm. This allows SlimFit to freeze up to 95% of layers and reduce the overall on-device GPU memory usage of transformer-based models such as ViT and BERT by an average of 2.2x, across different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10, CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory usage with an accuracy degradation of only up to 0.4%. As a result, while fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128 requires 3 and 2 32GB GPUs, respectively, SlimFit enables fine-tuning them on a single 32GB GPU without any significant accuracy degradation. The code of SlimFit is available at https://github.com/arashardakani/SlimFit.
2022
pdf
abs
Benchmarking Language Models for Code Syntax Understanding
Da Shen
|
Xinyun Chen
|
Chenguang Wang
|
Koushik Sen
|
Dawn Song
Findings of the Association for Computational Linguistics: EMNLP 2022
Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works show that pre-trained language models can capture the syntactic rules of natural languages without finetuning on syntax understanding tasks. However, there is limited understanding of how well pre-trained models understand the code structure so far. In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs. Specifically, we introduce CodeSyntax, a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. Our key observation is that pre-training on massive code data does not result in decent code syntax understanding. In fact, these pre-trained programming language models fail to match the performance of naive baselines based on positional offsets and keywords. We also present a natural language benchmark to highlight the differences between natural languages and programming languages in terms of understanding corresponding syntactic structures. Our findings point out key limitations of existing pre-training methods and suggest the importance of modeling syntactic structures for the programming language.
pdf
abs
PALT: Parameter-Lite Transfer of Language Models for Knowledge Graph Completion
Jianhao Shen
|
Chenguang Wang
|
Ye Yuan
|
Jiawei Han
|
Heng Ji
|
Koushik Sen
|
Ming Zhang
|
Dawn Song
Findings of the Association for Computational Linguistics: EMNLP 2022
This paper presents a parameter-lite transfer learning approach of pretrained language models (LM) for knowledge graph (KG) completion. Instead of finetuning, which modifies all LM parameters, we only tune a few new parameters while keeping the original LM parameters fixed. We establish this via reformulating KG completion as a “fill-in-the-blank” task, and introducing a parameter-lite encoder on top of the original LMs. We show that, by tuning far fewer parameters than finetuning, LMs transfer non-trivially to most tasks and reach competitiveness with prior state-of-the-art approaches. For instance, we outperform the fully finetuning approaches on a KG completion benchmark by tuning only 1% of the parameters.