SkipBERT: Efficient Inference with Shallow Layer Skipping

Jue Wang, Ke Chen, Gang Chen, Lidan Shou, Julian McAuley


Abstract
In this paper, we propose SkipBERT to accelerate BERT inference by skipping the computation of shallow layers. To achieve this, our approach encodes small text chunks into independent representations, which are then materialized to approximate the shallow representation of BERT. Since the use of such approximation is inexpensive compared with transformer calculations, we leverage it to replace the shallow layers of BERT to skip their runtime overhead. With off-the-shelf early exit mechanisms, we also skip redundant computation from the highest few layers to further improve inference efficiency. Results on GLUE show that our approach can reduce latency by 65% without sacrificing performance. By using only two-layer transformer calculations, we can still maintain 95% accuracy of BERT.
Anthology ID:
2022.acl-long.503
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7287–7301
Language:
URL:
https://aclanthology.org/2022.acl-long.503
DOI:
10.18653/v1/2022.acl-long.503
Bibkey:
Cite (ACL):
Jue Wang, Ke Chen, Gang Chen, Lidan Shou, and Julian McAuley. 2022. SkipBERT: Efficient Inference with Shallow Layer Skipping. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7287–7301, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
SkipBERT: Efficient Inference with Shallow Layer Skipping (Wang et al., ACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2022.acl-long.503.pdf
Software:
 2022.acl-long.503.software.zip
Code
 lorrinwww/skipbert
Data
CoLAMRPCMultiNLISQuADSSTSST-2