Pascal Poupart


2023

pdf
Do we need Label Regularization to Fine-tune Pre-trained Language Models?
Ivan Kobyzev | Aref Jafari | Mehdi Rezagholizadeh | Tianda Li | Alan Do-Omri | Peng Lu | Pascal Poupart | Ali Ghodsi
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Knowledge Distillation (KD) is a prominent neural model compression technique that heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need them to improve the fine-tuning of smaller PLM networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.

pdf
Attribute Controlled Dialogue Prompting
Runcheng Liu | Ahmad Rashid | Ivan Kobyzev | Mehdi Rezagholizadeh | Pascal Poupart
Findings of the Association for Computational Linguistics: ACL 2023

Prompt-tuning has become an increasingly popular parameter-efficient method for adapting large pretrained language models to downstream tasks. However, both discrete prompting and continuous prompting assume fixed prompts for all data samples within a task, neglecting the fact that inputs vary greatly in some tasks such as open-domain dialogue generation. In this paper, we present a novel, instance-specific prompt-tuning algorithm for dialogue generation. Specifically, we generate prompts based on instance-level control code, rather than the conversation history, to explore their impact on controlled dialogue generation. Experiments on popular open-domain dialogue datasets, evaluated on both automated metrics and human evaluation, demonstrate that our method is superior to prompting baselines and comparable to fine-tuning with only 5%-6% of total parameters.

pdf
Contrastive Deterministic Autoencoders For Language Modeling
Amur Ghose | Pascal Poupart
Findings of the Association for Computational Linguistics: EMNLP 2023

Variational autoencoders (VAEs) are a popular family of generative models with wide applicability. Training VAEs, especially for text, often runs into the issue of posterior collapse, resulting in loss of representation quality. Deterministic autoencoders avoid this issue, and have been explored particularly well for images. It is however unclear how to best modify a deterministic model designed for images into a successful one for text. We show that with suitable adaptations, we can significantly improve on batch-normed VAEs (BN-VAEs), a strong benchmark for language modeling with VAEs, by replacing them with analogous deterministic models. We employ techniques from contrastive learning to control the entropy of the aggregate posterior of these models to make it Gaussian. The resulting models skip reparametrization steps in VAE modeling and avoid posterior collapse, while outperforming a broad range of VAE models on text generation and downstream tasks from representations. These improvements are shown to be consistent across both LSTM and Transformer-based VAE architectures. Appropriate comparisons to BERT/GPT-2 based results are also included. We also qualitatively examine the latent space through interpolation to supplement the quantitative aspects of the model.

2022

pdf
WatClaimCheck: A new Dataset for Claim Entailment and Inference
Kashif Khan | Ruizhe Wang | Pascal Poupart
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We contribute a new dataset for the task of automated fact checking and an evaluation of state of the art algorithms. The dataset includes claims (from speeches, interviews, social media and news articles), review articles published by professional fact checkers and premise articles used by those professional fact checkers to support their review and verify the veracity of the claims. An important challenge in the use of premise articles is the identification of relevant passages that will help to infer the veracity of a claim. We show that transferring a dense passage retrieval model trained with review articles improves the retrieval quality of passages in premise articles. We report results for the prediction of claim veracity by inference from premise articles.

pdf
RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation
Md Akmal Haidar | Nithin Anchuri | Mehdi Rezagholizadeh | Abbas Ghaddar | Philippe Langlais | Pascal Poupart
Findings of the Association for Computational Linguistics: NAACL 2022

Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforces that all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it acts as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.

pdf
Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization
Aref Jafari | Ivan Kobyzev | Mehdi Rezagholizadeh | Pascal Poupart | Ali Ghodsi
Findings of the Association for Computational Linguistics: EMNLP 2022

Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model’s (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher’s output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).

pdf
CILDA: Contrastive Data Augmentation Using Intermediate Layer Knowledge Distillation
Md Akmal Haidar | Mehdi Rezagholizadeh | Abbas Ghaddar | Khalil Bibi | Phillippe Langlais | Pascal Poupart
Proceedings of the 29th International Conference on Computational Linguistics

Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning-based data augmentation technique tailored for knowledge distillation, called CILDA. To the best of our knowledge, this is the first time that intermediate layer representations of the main task are used in improving the quality of augmented samples. More precisely, we introduce an augmentation technique for KD based on intermediate layer matching using contrastive loss to improve masked adversarial data augmentation. CILDA outperforms existing state-of-the-art KD approaches on the GLUE benchmark, as well as in an out-of-domain evaluation.

2018

pdf
Variational Attention for Sequence-to-Sequence Models
Hareesh Bahuleyan | Lili Mou | Olga Vechtomova | Pascal Poupart
Proceedings of the 27th International Conference on Computational Linguistics

The variational encoder-decoder (VED) encodes source information as a set of random variables using a neural network, which in turn is decoded into target data using another neural network. In natural language processing, sequence-to-sequence (Seq2Seq) models typically serve as encoder-decoder networks. When combined with a traditional (deterministic) attention mechanism, the variational latent space may be bypassed by the attention model, and thus becomes ineffective. In this paper, we propose a variational attention mechanism for VED, where the attention vector is also modeled as Gaussian distributed random variables. Results on two experiments show that, without loss of quality, our proposed method alleviates the bypassing phenomenon as it increases the diversity of generated sentences.

2017

pdf
Deep Active Learning for Dialogue Generation
Nabiha Asghar | Pascal Poupart | Xin Jiang | Hang Li
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

We propose an online, end-to-end, neural generative conversational model for open-domain dialogue. It is trained using a unique combination of offline two-phase supervised learning and online human-in-the-loop active learning. While most existing research proposes offline supervision or hand-crafted reward functions for online reinforcement, we devise a novel interactive learning mechanism based on hamming-diverse beam search for response generation and one-character user-feedback at each step. Experiments show that our model inherently promotes the generation of semantically relevant and interesting responses, and can be used to train agents with customized personas, moods and conversational styles.

2016

pdf
Overfitting at SemEval-2016 Task 3: Detecting Semantically Similar Questions in Community Question Answering Forums with Word Embeddings
Hujie Wang | Pascal Poupart
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2007

pdf
Generating Lexical Analogies Using Dependency Relations
Andy Chiu | Pascal Poupart | Chrysanne DiMarco
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2005

pdf
Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management
Jason D. Williams | Pascal Poupart | Steve Young
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue