James Cross


2021

pdf bib
Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders
Xiang Kong | Adithya Renduchintala | James Cross | Yuqing Tang | Jiatao Gu | Xian Li
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Recent work in multilingual translation advances translation quality surpassing bilingual baselines using deep transformer models with increased capacity. However, the extra latency and memory costs introduced by this approach may make it unacceptable for efficiency-constrained applications. It has recently been shown for bilingual translation that using a deep encoder and shallow decoder (DESD) can reduce inference latency while maintaining translation quality, so we study similar speed-accuracy trade-offs for multilingual translation. We find that for many-to-one translation we can indeed increase decoder speed without sacrificing quality using this approach, but for one-to-many translation, shallow decoders cause a clear quality drop. To ameliorate this drop, we propose a deep encoder with multiple shallow decoders (DEMSD) where each shallow decoder is responsible for a disjoint subset of target languages. Specifically, the DEMSD model with 2-layer decoders is able to obtain a 1.8x speedup on average compared to a standard transformer model with no drop in translation quality.

pdf bib
Improving Zero-Shot Translation by Disentangling Positional Information
Danni Liu | Jan Niehues | James Cross | Francisco Guzmán | Xian Li
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. We demonstrate that a main factor causing the language-specific representations is the positional correspondence to input tokens. We show that this can be easily alleviated by removing residual connections in an encoder layer. With this modification, we gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions. The improvements are particularly prominent between related languages, where our proposed model outperforms pivot-based translation. Moreover, our approach allows easy integration of new languages, which substantially expands translation coverage. By thorough inspections of the hidden layer outputs, we show that our approach indeed leads to more language-independent representations.

pdf bib
Classification-based Quality Estimation: Small and Efficient Models for Real-world Applications
Shuo Sun | Ahmed El-Kishky | Vishrav Chaudhary | James Cross | Lucia Specia | Francisco Guzmán
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task, and the performance of QE models is typically measured by Pearson correlation with human labels. Recent QE models have achieved previously-unseen levels of correlation with human judgments, but they rely on large multilingual contextualized language models that are computationally expensive and make them infeasible for real-world applications. In this work, we evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting. We observe that a full model parameterization is required to achieve SoTA results in a regression task. However, we argue that the level of expressiveness of a model in a continuous range is unnecessary given the downstream applications of QE, and show that reframing QE as a classification problem and evaluating QE models using classification metrics would better reflect their actual performance in real-world applications.

pdf bib
XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment
Ahmed El-Kishky | Adithya Renduchintala | James Cross | Francisco Guzmán | Philipp Koehn
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Cross-lingual named-entity lexica are an important resource to multilingual NLP tasks such as machine translation and cross-lingual wikification. While knowledge bases contain a large number of entities in high-resource languages such as English and French, corresponding entities for lower-resource languages are often missing. To address this, we propose Lexical-Semantic-Phonetic Align (LSP-Align), a technique to automatically mine cross-lingual entity lexica from mined web data. We demonstrate LSP-Align outperforms baselines at extracting cross-lingual entity pairs and mine 164 million entity pairs from 120 different languages aligned with English. We release these cross-lingual entity pairs along with the massively multilingual tagged named entity corpus as a resource to the NLP community.

pdf bib
Facebook AI’s WMT21 News Translation Task Submission
Chau Tran | Shruti Bhosale | James Cross | Philipp Koehn | Sergey Edunov | Angela Fan
Proceedings of the Sixth Conference on Machine Translation

We describe Facebook’s multilingual model submission to the WMT2021 shared task on news translation. We participate in 14 language directions: English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese. To develop systems covering all these directions, we focus on multilingual models. We utilize data from all available sources — WMT, large-scale data mining, and in-domain backtranslation — to create high quality bilingual and multilingual baselines. Subsequently, we investigate strategies for scaling multilingual model size, such that one system has sufficient capacity for high quality representations of all eight languages. Our final submission is an ensemble of dense and sparse Mixture-of-Expert multilingual translation models, followed by finetuning on in-domain news data and noisy channel reranking. Compared to previous year’s winning submissions, our multilingual system improved the translation quality on all language directions, with an average improvement of 2.0 BLEU. In the WMT2021 task, our system ranks first in 10 directions based on automatic evaluation.

2020

bib
A Survey of Qualitative Error Analysis for Neural Machine Translation Systems
Denise Díaz | James Cross | Vishrav Chaudhary | Ahmed Kishky | Philipp Koehn
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib
Proceedings of the First Workshop on Automatic Simultaneous Translation
Hua Wu | Collin Cherry | Liang Huang | Zhongjun He | Mark Liberman | James Cross | Yang Liu
Proceedings of the First Workshop on Automatic Simultaneous Translation

2018

pdf bib
Simple Fusion: Return of the Language Model
Felix Stahlberg | James Cross | Veselin Stoyanov
Proceedings of the Third Conference on Machine Translation: Research Papers

Neural Machine Translation (NMT) typically leverages monolingual data in training through backtranslation. We investigate an alternative simple method to use monolingual data for NMT training: We combine the scores of a pre-trained and fixed language model (LM) with the scores of a translation model (TM) while the TM is trained from scratch. To achieve that, we train the translation model to predict the residual probability of the training data added to the prediction of the LM. This enables the TM to focus its capacity on modeling the source sentence since it can rely on the LM for fluency. We show that our method outperforms previous approaches to integrate LMs into NMT while the architecture is simpler as it does not require gating networks to balance TM and LM. We observe gains of between +0.24 and +2.36 BLEU on all four test sets (English-Turkish, Turkish-English, Estonian-English, Xhosa-English) on top of ensembles without LM. We compare our method with alternative ways to utilize monolingual data such as backtranslation, shallow fusion, and cold fusion.

2016

pdf bib
Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles
James Cross | Liang Huang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Incremental Parsing with Minimal Features Using Bi-Directional LSTM
James Cross | Liang Huang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Optimal Incremental Parsing via Best-First Dynamic Programming
Kai Zhao | James Cross | Liang Huang
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing