Maximizing the Effectiveness of Larger BERT Models for Compression
Wen-Shu Fan, Su Lu, Shangyu Xing, Xin-Chun Li, De-Chuan Zhan
Abstract
Knowledge distillation (KD) is a widely used approach for BERT compression, where a larger BERT model serves as a teacher to transfer knowledge to a smaller student model. Prior works have found that distilling a larger BERT with superior performance may degrade student’s performance than a smaller BERT. In this paper, we investigate the limitations of existing KD methods for larger BERT models. Through Canonical Correlation Analysis, we identify that these methods fail to fully exploit the potential advantages of larger teachers. To address this, we propose an improved distillation approach that effectively enhances knowledge transfer. Comprehensive experiments demonstrate the effectiveness of our method in enabling larger BERT models to distill knowledge more efficiently.- Anthology ID:
- 2025.acl-long.1067
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 21975–21990
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1067/
- DOI:
- Cite (ACL):
- Wen-Shu Fan, Su Lu, Shangyu Xing, Xin-Chun Li, and De-Chuan Zhan. 2025. Maximizing the Effectiveness of Larger BERT Models for Compression. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21975–21990, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Maximizing the Effectiveness of Larger BERT Models for Compression (Fan et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1067.pdf