Multiple Teacher Distillation for Robust and Greener Models
Artur Ilichev, Nikita Sorokin, Irina Piontkovskaya, Valentin Malykh
Abstract
The language models nowadays are in the center of natural language processing progress. These models are mostly of significant size. There are successful attempts to reduce them, but at least some of these attempts rely on randomness. We propose a novel distillation procedure leveraging on multiple teachers usage which alleviates random seed dependency and makes the models more robust. We show that this procedure applied to TinyBERT and DistilBERT models improves their worst case results up to 2% while keeping almost the same best-case ones. The latter fact keeps true with a constraint on computational time, which is important to lessen the carbon footprint. In addition, we present the results of an application of the proposed procedure to a computer vision model ResNet, which shows that the statement keeps true in this totally different domain.- Anthology ID:
- 2021.ranlp-1.68
- Volume:
- Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
- Month:
- September
- Year:
- 2021
- Address:
- Held Online
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 601–610
- Language:
- URL:
- https://aclanthology.org/2021.ranlp-1.68
- DOI:
- Cite (ACL):
- Artur Ilichev, Nikita Sorokin, Irina Piontkovskaya, and Valentin Malykh. 2021. Multiple Teacher Distillation for Robust and Greener Models. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 601–610, Held Online. INCOMA Ltd..
- Cite (Informal):
- Multiple Teacher Distillation for Robust and Greener Models (Ilichev et al., RANLP 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2021.ranlp-1.68.pdf
- Data
- CIFAR-10, CoLA, GLUE, MRPC, QNLI, SST, SST-2