2023
pdf
abs
Tailoring Instructions to Student’s Learning Levels Boosts Knowledge Distillation
Yuxin Ren
|
Zihan Zhong
|
Xingjian Shi
|
Yi Zhu
|
Chun Yuan
|
Mu Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer. In order to enhance the guidance of the teacher training process, we introduce the concept of distillation influence to determine the impact of distillation from each training sample on the student’s generalization ability. In this paper, we propose Learning Good Teacher Matters (LGTM), an efficient training technique for incorporating distillation influence into the teacher’s learning process. By prioritizing samples that are likely to enhance the student’s generalization ability, our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
pdf
abs
Automated Few-Shot Classification with Instruction-Finetuned Language Models
Rami Aly
|
Xingjian Shi
|
Kaixiang Lin
|
Aston Zhang
|
Andrew Wilson
Findings of the Association for Computational Linguistics: EMNLP 2023
A particularly successful class of approaches for few-shot learning combines language models with prompts - hand-crafted task descriptions that complement data samples. However, designing prompts by hand for each task commonly requires domain knowledge and substantial guesswork. We observe, in the context of classification tasks, that instruction finetuned language models are remarkably robust towards some dimensions of a prompt’s design. We subsequently propose a simple method to eliminate the need for handcrafted prompts, named AuT-Few. This approach consists of (i) a prompt retrieval module that selects suitable task instructions from the instruction-tuning knowledge base, and (ii) the generation of two distinct, semantically meaningful, class descriptions and a selection mechanism via cross-validation. Over 12 datasets, spanning 8 classification tasks, we show that AuT-Few outperforms current state-of-the-art few-shot learning methods. Moreover, AuT-Few is the best ranking method across datasets on the RAFT few-shot benchmark. Notably, these results are achieved without task-specific handcrafted prompts on unseen tasks.
2021
pdf
abs
Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing
Haoyu He
|
Xingjian Shi
|
Jonas Mueller
|
Sheng Zha
|
Mu Li
|
George Karypis
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
Knowledge Distillation (KD) offers a natural way to reduce the latency and memory/energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component’s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MI-objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.
2019
bib
abs
Dive into Deep Learning for Natural Language Processing
Haibin Lin
|
Xingjian Shi
|
Leonard Lausen
|
Aston Zhang
|
He He
|
Sheng Zha
|
Alexander Smola
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts
Deep learning has become the dominant approach to NLP problems, especially when applied on large scale corpora. Recent progress on unsupervised pre-training techniques such as BERT, ELMo, GPT-2, and language modeling in general, when applied on large corpora, is shown to be effective in improving a wide variety of downstream tasks. These techniques push the limits of available hardware, requiring specialized frameworks optimized for GPU, ASIC, and distributed cloud-based training.A few complexities pose challenges to scale these models and algorithms effectively. Compared to other areas where deep learning is applied, these NLP models contain a variety of moving parts: text normalization and tokenization, word representation at subword-level and word-level, variable-length models such as RNN and attention, and sequential decoder based on beam search, among others.In this hands-on tutorial, we take a closer look at the challenges from these complexities and see how with proper tooling with Apache MXNet and GluonNLP, we can overcome these challenges and achieve state-of-the-art results for real-world problems. GluonNLP is a powerful new toolkit that combines MXNet’s speed, the flexibility of Gluon, and an extensive new library automating the most laborious aspects of deep learning for NLP.