Minh-Thang Luong

Also published as: Thang Luong

2023

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English, such as Vietnamese. In this paper, we use a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained and supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.

2021

Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.

pdf abs
STraTA: Self-Training with Task Augmentation for Better Few-shot Learning
Tu Vu | Minh-Thang Luong | Quoc Le | Grady Simon | Mohit Iyyer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective.

2020

pdf abs
Pre-Training Transformers as Energy-Based Cloze Models
Kevin Clark | Minh-Thang Luong | Quoc Le | Christopher D. Manning
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. We train Electric using an algorithm based on noise-contrastive estimation and elucidate how this learning objective is closely related to the recently proposed ELECTRA pre-training method. Electric performs well when transferred to downstream tasks and is particularly effective at producing likelihood scores for text: it re-ranks speech recognition n-best lists better than language models and much faster than masked language models. Furthermore, it offers a clearer and more principled view of what ELECTRA learns during pre-training.

2019

This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language.

pdf abs
BAM! Born-Again Multi-Task Networks for Natural Language Understanding
Kevin Clark | Minh-Thang Luong | Urvashi Khandelwal | Christopher D. Manning | Quoc V. Le
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

2018

pdf abs
Semi-Supervised Sequence Modeling with Cross-View Training
Kevin Clark | Minh-Thang Luong | Christopher D. Manning | Quoc Le
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View Training (CVT), a semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data. On labeled examples, standard supervised learning is used. On unlabeled examples, CVT teaches auxiliary prediction modules that see restricted views of the input (e.g., only part of a sentence) to match the predictions of the full model seeing the whole input. Since the auxiliary modules and the full model share intermediate representations, this in turn improves the full model. Moreover, we show that CVT is particularly effective when combined with multi-task learning. We evaluate CVT on five sequence tagging tasks, machine translation, and dependency parsing, achieving state-of-the-art results.

pdf bib
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
Alexandra Birch | Andrew Finch | Thang Luong | Graham Neubig | Yusuke Oda
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

pdf bib abs
Findings of the Second Workshop on Neural Machine Translation and Generation
Alexandra Birch | Andrew Finch | Minh-Thang Luong | Graham Neubig | Yusuke Oda
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This document describes the findings of the Second Workshop on Neural Machine Translation and Generation, held in concert with the annual conference of the Association for Computational Linguistics (ACL 2018). First, we summarize the research trends of papers presented in the proceedings, and note that there is particular interest in linguistic structure, domain adaptation, data augmentation, handling inadequate resources, and analysis of models. Second, we describe the results of the workshop’s shared task on efficient neural machine translation, where participants were tasked with creating MT systems that are both accurate and efficient.

2017

pdf abs
Efficient Attention using a Fixed-Size Memory Representation
Denny Britz | Melody Guan | Minh-Thang Luong
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The standard content-based attention mechanism typically used in sequence-to-sequence models is computationally expensive as it requires the comparison of large encoder and decoder states at each time step. In this work, we propose an alternative attention mechanism based on a fixed size memory representation that is more efficient. Our technique predicts a compact set of K attention contexts during encoding and lets the decoder compute an efficient lookup that does not need to consult the memory. We show that our approach performs on-par with the standard attention mechanism while yielding inference speedups of 20% for real-world translation tasks and more for tasks with longer sequences. By visualizing attention scores we demonstrate that our models learn distinct, meaningful alignments.

pdf abs
Massive Exploration of Neural Machine Translation Architectures
Denny Britz | Anna Goldie | Minh-Thang Luong | Quoc Le
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Neural Machine Translation (NMT) has shown remarkable progress over the past few years, with production systems now being deployed to end-users. As the field is moving rapidly, it has become unclear which elements of NMT architectures have a significant impact on translation quality. In this work, we present a large-scale analysis of the sensitivity of NMT architectures to common hyperparameters. We report empirical results and variance numbers for several hundred experimental runs, corresponding to over 250,000 GPU hours on a WMT English to German translation task. Our experiments provide practical insights into the relative importance of factors such as embedding size, network depth, RNN cell type, residual connections, attention mechanism, and decoding heuristics. As part of this contribution, we also release an open-source NMT framework in TensorFlow to make it easy for others to reproduce our results and perform their own experiments.

pdf bib
Proceedings of the First Workshop on Neural Machine Translation
Thang Luong | Alexandra Birch | Graham Neubig | Andrew Finch
Proceedings of the First Workshop on Neural Machine Translation

2016

pdf
Compression of Neural Machine Translation Models via Pruning
Abigail See | Minh-Thang Luong | Christopher D. Manning
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

pdf
Models and Inference for Prefix-Constrained Machine Translation
Joern Wuebker | Spence Green | John DeNero | Saša Hasan | Minh-Thang Luong
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
Minh-Thang Luong | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

abs
Neural Machine Translation
Thang Luong | Kyunghyun Cho | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Neural Machine Translation (NMT) is a simple new architecture for getting machines to learn to translate. Despite being relatively new (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014), NMT has already shown promising results, achieving state-of-the-art performances for various language pairs (Luong et al, 2015a; Jean et al, 2015; Luong et al, 2015b; Sennrich et al., 2016; Luong and Manning, 2016). While many of these NMT papers were presented to the ACL community, research and practice of NMT are only at their beginning stage. This tutorial would be a great opportunity for the whole community of machine translation and natural language processing to learn more about a very promising new approach to MT. This tutorial has four parts.In the first part, we start with an overview of MT approaches, including: (a) traditional methods that have been dominant over the past twenty years and (b) recent hybrid models with the use of neural network components. From these, we motivate why an end-to-end approach like neural machine translation is needed. The second part introduces a basic instance of NMT. We start out with a discussion of recurrent neural networks, including the back-propagation-through-time algorithm and stochastic gradient descent optimizers, as these are the foundation on which NMT builds. We then describe in detail the basic sequence-to-sequence architecture of NMT (Cho et al., 2014; Sutskever et al., 2014), the maximum likelihood training approach, and a simple beam-search decoder to produce translations.The third part of our tutorial describes techniques to build state-of-the-art NMT. We start with approaches to extend the vocabulary coverage of NMT (Luong et al., 2015a; Jean et al., 2015; Chitnis and DeNero, 2015). We then introduce the idea of jointly learning both translations and alignments through an attention mechanism (Bahdanau et al., 2015); other variants of attention (Luong et al., 2015b; Tu et al., 2016) are discussed too. We describe a recent trend in NMT, that is to translate at the sub-word level (Chung et al., 2016; Luong and Manning, 2016; Sennrich et al., 2016), so that language variations can be effectively handled. We then give tips on training and testing NMT systems such as batching and ensembling. In the final part of the tutorial, we briefly describe promising approaches, such as (a) how to combine multiple tasks to help translation (Dong et al., 2015; Luong et al., 2016; Firat et al., 2016; Zoph and Knight, 2016) and (b) how to utilize monolingual corpora (Sennrich et al., 2016). Lastly, we conclude with challenges remained to be solved for future NMT.PS: we would also like to acknowledge the very first paper by Forcada and Ñeco (1997) on sequence-to-sequence models for translation!

2015

pdf
Learning Distributed Representations for Multilingual Text Sequences
Hieu Pham | Thang Luong | Christopher Manning
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf
Bilingual Word Representations with Monolingual Quality in Mind
Thang Luong | Hieu Pham | Christopher D. Manning
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf
Evaluating Models of Computation and Storage in Human Sentence Processing
Thang Luong | Timothy O’Donnell | Noah Goodman
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf
Deep Neural Language Models for Machine Translation
Thang Luong | Michael Kayser | Christopher D. Manning
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf
Stanford neural machine translation systems for spoken language domains
Minh-Thang Luong | Christopher Manning
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Addressing the Rare Word Problem in Neural Machine Translation
Thang Luong | Ilya Sutskever | Quoc Le | Oriol Vinyals | Wojciech Zaremba
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
A Hierarchical Neural Autoencoder for Paragraphs and Documents
Jiwei Li | Thang Luong | Dan Jurafsky
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
Effective Approaches to Attention-based Neural Machine Translation
Thang Luong | Hieu Pham | Christopher D. Manning
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
When Are Tree Structures Necessary for Deep Learning of Representations?
Jiwei Li | Thang Luong | Dan Jurafsky | Eduard Hovy
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2013

pdf
Better Word Representations with Recursive Neural Networks for Morphology
Thang Luong | Richard Socher | Christopher Manning
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

pdf abs
Parsing entire discourses as very long strings: Capturing topic continuity in grounded language learning
Minh-Thang Luong | Michael C. Frank | Mark Johnson
Transactions of the Association for Computational Linguistics, Volume 1

Grounded language learning, the task of mapping from natural language to a representation of meaning, has attracted more and more interest in recent years. In most work on this topic, however, utterances in a conversation are treated independently and discourse structure information is largely ignored. In the context of language acquisition, this independence assumption discards cues that are important to the learner, e.g., the fact that consecutive utterances are likely to share the same referent (Frank et al., 2013). The current paper describes an approach to the problem of simultaneously modeling grounded language at the sentence and discourse levels. We combine ideas from parsing and grammar induction to produce a parser that can handle long input strings with thousands of tokens, creating parse trees that represent full discourses. By casting grounded language learning as a grammatical inference task, we use our parser to extend the work of Johnson et al. (2012), investigating the importance of discourse continuity in children’s language acquisition and its interaction with social cues. Our model boosts performance in a language acquisition task and yields good discourse segmentations compared with human annotators.