Phillip Keung


Domain Mismatch Doesn’t Always Prevent Cross-lingual Transfer Learning
Daniel Edmiston | Phillip Keung | Noah A. Smith
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Cross-lingual transfer learning without labeled target language data or parallel text has been surprisingly effective in zero-shot cross-lingual classification, question answering, unsupervised machine translation, etc. However, some recent publications have claimed that domain mismatch prevents cross-lingual transfer, and their results show that unsupervised bilingual lexicon induction (UBLI) and unsupervised neural machine translation (UNMT) do not work well when the underlying monolingual corpora come from different domains (e.g., French text from Wikipedia but English text from UN proceedings). In this work, we show how a simple initialization regimen can overcome much of the effect of domain mismatch in cross-lingual transfer. We pre-train word and contextual embeddings on the concatenated domain-mismatched corpora, and use these as initializations for three tasks: MUSE UBLI, UN Parallel UNMT, and the SemEval 2017 cross-lingual word similarity task. In all cases, our results challenge the conclusions of prior work by showing that proper initialization can recover a large portion of the losses incurred by domain mismatch.

The Engage Corpus: A Social Media Dataset for Text-Based Recommender Systems
Daniel Cheng | Kyle Yan | Phillip Keung | Noah A. Smith
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Social media platforms play an increasingly important role as forums for public discourse. Many platforms use recommendation algorithms that funnel users to online groups with the goal of maximizing user engagement, which many commentators have pointed to as a source of polarization and misinformation. Understanding the role of NLP in recommender systems is an interesting research area, given the role that social media has played in world events. However, there are few standardized resources which researchers can use to build models that predict engagement with online groups on social media; each research group constructs datasets from scratch without releasing their version for reuse. In this work, we present a dataset drawn from posts and comments on the online message board Reddit. We develop baseline models for recommending subreddits to users, given the user’s post and comment history. We also study the behavior of our recommender models on subreddits that were banned in June 2020 as part of Reddit’s efforts to stop the dissemination of hate speech.


Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings
Phillip Keung | Julian Salazar | Yichao Lu | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 8

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT’14 French-English and WMT’16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT’15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

Don’t Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings
Phillip Keung | Yichao Lu | Julian Salazar | Vikas Bhardwaj
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on one source language and evaluated on a different target language. However, published results for mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results on the MLDoc and XNLI tasks. English dev accuracy is often uncorrelated (or even anti-correlated) with target language accuracy, and zero-shot performance varies greatly at different points in the same fine-tuning run and between different fine-tuning runs. These reproducibility issues are also present for other tasks with different pre-trained embeddings (e.g., MLQA with XLM-R). We recommend providing oracle scores alongside zero-shot results: still fine-tune using English data, but choose a checkpoint with the target dev set. Reporting this upper bound makes results more consistent by avoiding arbitrarily bad checkpoints.

The Multilingual Amazon Reviews Corpus
Phillip Keung | Yichao Lu | György Szarvas | Noah A. Smith
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., ‘books’, ‘appliances’, etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.

Improving Non-autoregressive Neural Machine Translation with Monolingual Data
Jiawei Zhou | Phillip Keung
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Non-autoregressive (NAR) neural machine translation is usually done via knowledge distillation from an autoregressive (AR) model. Under this framework, we leverage large monolingual corpora to improve the NAR model’s performance, with the goal of transferring the AR model’s generalization ability while preventing overfitting. On top of a strong NAR baseline, our experimental results on the WMT14 En-De and WMT16 En-Ro news translation tasks confirm that monolingual data augmentation consistently improves the performance of the NAR model to approach the teacher AR model’s performance, yields comparable or better results than the best non-iterative NAR methods in the literature and helps reduce overfitting in the training process.


Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER
Phillip Keung | Yichao Lu | Vikas Bhardwaj
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Contextual word embeddings (e.g. GPT, BERT, ELMo, etc.) have demonstrated state-of-the-art performance on various NLP tasks. Recent work with the multilingual version of BERT has shown that the model performs surprisingly well in cross-lingual settings, even when only labeled English data is used to finetune the model. We improve upon multilingual BERT’s zero-resource cross-lingual performance via adversarial learning. We report the magnitude of the improvement on the multilingual MLDoc text classification and CoNLL 2002/2003 named entity recognition tasks. Furthermore, we show that language-adversarial training encourages BERT to align the embeddings of English documents and their translations, which may be the cause of the observed performance gains.


A neural interlingua for multilingual machine translation
Yichao Lu | Phillip Keung | Faisal Ladhak | Vikas Bhardwaj | Shaonan Zhang | Jason Sun
Proceedings of the Third Conference on Machine Translation: Research Papers

We incorporate an explicit neural interlingua into a multilingual encoder-decoder neural machine translation (NMT) architecture. We demonstrate that our model learns a language-independent representation by performing direct zero-shot translation (without using pivot translation), and by using the source sentence embeddings to create an English Yelp review classifier that, through the mediation of the neural interlingua, can also classify French and German reviews. Furthermore, we show that, despite using a smaller number of parameters than a pairwise collection of bilingual NMT models, our approach produces comparable BLEU scores for each language pair in WMT15.