Abstract
In this paper, we propose SkipBERT to accelerate BERT inference by skipping the computation of shallow layers. To achieve this, our approach encodes small text chunks into independent representations, which are then materialized to approximate the shallow representation of BERT. Since the use of such approximation is inexpensive compared with transformer calculations, we leverage it to replace the shallow layers of BERT to skip their runtime overhead. With off-the-shelf early exit mechanisms, we also skip redundant computation from the highest few layers to further improve inference efficiency. Results on GLUE show that our approach can reduce latency by 65% without sacrificing performance. By using only two-layer transformer calculations, we can still maintain 95% accuracy of BERT.- Anthology ID:
- 2022.acl-long.503
- Volume:
- Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Smaranda Muresan, Preslav Nakov, Aline Villavicencio
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7287–7301
- Language:
- URL:
- https://aclanthology.org/2022.acl-long.503
- DOI:
- 10.18653/v1/2022.acl-long.503
- Cite (ACL):
- Jue Wang, Ke Chen, Gang Chen, Lidan Shou, and Julian McAuley. 2022. SkipBERT: Efficient Inference with Shallow Layer Skipping. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7287–7301, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- SkipBERT: Efficient Inference with Shallow Layer Skipping (Wang et al., ACL 2022)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2022.acl-long.503.pdf
- Code
- lorrinwww/skipbert
- Data
- CoLA, MRPC, MultiNLI, SQuAD, SST, SST-2