Synthetic Data for Neural Machine Translation of Spoken-Dialects

Hany Hassan; Mostafa Elaraby; Ahmed Y. Tawfik

Synthetic Data for Neural Machine Translation of Spoken-Dialects

Hany Hassan, Mostafa Elaraby, Ahmed Y. Tawfik

Abstract

In this paper, we introduce a novel approach to generate synthetic data for training Neural Machine Translation systems. The proposed approach supports language variants and dialects with very limited parallel training data. This is achieved using a seed data to project words from a closely-related resource-rich language to an under-resourced language variant via word embedding representations. The proposed approach is based on localized embedding projection of distributed representations which utilizes monolingual embeddings and approximate nearest neighbors queries to transform parallel data across language variants. Our approach is language independent and can be used to generate data for any variant of the source language such as slang or spoken dialect or even for a different language that is related to the source language. We report experimental results on Levantine to English translation using Neural Machine Translation. We show that the synthetic data can provide significant improvements over a very large scale system by more than 2.8 Bleu points and it can be used to provide a reliable translation system for a spoken dialect which does not have sufficient parallel data.

Anthology ID:: 2017.iwslt-1.12
Volume:: Proceedings of the 14th International Conference on Spoken Language Translation
Month:: December 14-15
Year:: 2017
Address:: Tokyo, Japan
Venue:: IWSLT
SIG:
Publisher:: International Workshop on Spoken Language Translation
Note:
Pages:: 82–89
Language:
URL:: https://aclanthology.org/2017.iwslt-1.12
DOI:
Bibkey:
Cite (ACL):: Hany Hassan, Mostafa Elaraby, and Ahmed Y. Tawfik. 2017. Synthetic Data for Neural Machine Translation of Spoken-Dialects. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 82–89, Tokyo, Japan. International Workshop on Spoken Language Translation.
Cite (Informal):: Synthetic Data for Neural Machine Translation of Spoken-Dialects (Hassan et al., IWSLT 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2017.iwslt-1.12.pdf

PDF Cite Search