Monolingual Embeddings for Low Resourced Neural Machine Translation

Mattia A. Di Gangi; Marcello Federico

Monolingual Embeddings for Low Resourced Neural Machine Translation

Mattia Antonino Di Gangi, Marcello Federico

Abstract

Neural machine translation (NMT) is the state of the art for machine translation, and it shows the best performance when there is a considerable amount of data available. When only little data exist for a language pair, the model cannot produce good representations for words, particularly for rare words. One common solution consists in reducing data sparsity by segmenting words into sub-words, in order to allow rare words to have shared representations with other words. Taking a different approach, in this paper we present a method to feed an NMT network with word embeddings trained on monolingual data, which are combined with the task-specific embeddings learned at training time. This method can leverage an embedding matrix with a huge number of words, which can therefore extend the word-level vocabulary. Our experiments on two language pairs show good results for the typical low-resourced data scenario (IWSLT in-domain dataset). Our consistent improvements over the baselines represent a positive proof about the possibility to leverage models pre-trained on monolingual data in NMT.

Anthology ID:: 2017.iwslt-1.14
Volume:: Proceedings of the 14th International Conference on Spoken Language Translation
Month:: December 14-15
Year:: 2017
Address:: Tokyo, Japan
Editors:: Sakriani Sakti, Masao Utiyama
Venue:: IWSLT
SIG:: SIGSLT
Publisher:: International Workshop on Spoken Language Translation
Note:
Pages:: 97–104
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2017.iwslt-1.14/
DOI:
Bibkey:
Cite (ACL):: Mattia Antonino Di Gangi and Marcello Federico. 2017. Monolingual Embeddings for Low Resourced Neural Machine Translation. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 97–104, Tokyo, Japan. International Workshop on Spoken Language Translation.
Cite (Informal):: Monolingual Embeddings for Low Resourced Neural Machine Translation (Di Gangi & Federico, IWSLT 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2017.iwslt-1.14.pdf
Code: mattiadg/nmt-external-embeddings

PDF Cite Search Code Fix data