FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

Young Jin Kim, Hany Hassan


Abstract
Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks. We show how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency. We provide effective recipes that can guide practitioners to choose the best settings for various NLU tasks and pretrained models. Applying the proposed recipes to the SuperGLUE benchmark, we achieve from 9.8x up to 233.9x speed-up compared to out-of-the-box models on CPU. On GPU, we also achieve up to 12.4x speed-up with the presented methods. We show that FastFormers can drastically reduce cost of serving 100 million requests from 4,223 USD to just 18 USD on an Azure F16s_v2 instance. This translates to a sustainable runtime by reducing energy consumption 6.9x - 125.8x according to the metrics used in the SustaiNLP 2020 shared task.
Anthology ID:
2020.sustainlp-1.20
Volume:
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
Month:
November
Year:
2020
Address:
Online
Editors:
Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glavaš, Shafiq Joty, Alex Wang, Thomas Wolf
Venue:
sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
149–158
Language:
URL:
https://aclanthology.org/2020.sustainlp-1.20
DOI:
10.18653/v1/2020.sustainlp-1.20
Bibkey:
Cite (ACL):
Young Jin Kim and Hany Hassan. 2020. FastFormers: Highly Efficient Transformer Models for Natural Language Understanding. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 149–158, Online. Association for Computational Linguistics.
Cite (Informal):
FastFormers: Highly Efficient Transformer Models for Natural Language Understanding (Kim & Hassan, sustainlp 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2020.sustainlp-1.20.pdf
Video:
 https://slideslive.com/38939442
Code
 microsoft/fastformers +  additional community code
Data
SuperGLUE