Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor
Xinyu Wang, Yong Jiang, Zhaohui Yan, Zixia Jia, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, Kewei Tu
Abstract
Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student’s output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces more fine-grained substructures than the teacher factorization; 3) the teacher factorization produces more fine-grained substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible.- Anthology ID:
- 2021.acl-long.46
- Volume:
- Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
- Venues:
- ACL | IJCNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 550–564
- Language:
- URL:
- https://aclanthology.org/2021.acl-long.46
- DOI:
- 10.18653/v1/2021.acl-long.46
- Cite (ACL):
- Xinyu Wang, Yong Jiang, Zhaohui Yan, Zixia Jia, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Kewei Tu. 2021. Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 550–564, Online. Association for Computational Linguistics.
- Cite (Informal):
- Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor (Wang et al., ACL-IJCNLP 2021)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2021.acl-long.46.pdf
- Code
- Alibaba-NLP/StructuralKD
- Data
- CoNLL 2002, CoNLL 2003, Penn Treebank, WikiAnn