Abstract
We propose a simple log-bilinear softmax-based model to deal with vocabulary expansion in machine translation. Our model uses word embeddings trained on significantly large unlabelled monolingual corpora and learns over a fairly small, word-to-word bilingual dictionary. Given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language using the trained bilingual embeddings. We integrate these translation options into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English–Spanish language pair. When tested over an out-of-domain testset, we get a significant improvement of 3.9 BLEU points.- Anthology ID:
- W17-2617
- Volume:
- Proceedings of the 2nd Workshop on Representation Learning for NLP
- Month:
- August
- Year:
- 2017
- Address:
- Vancouver, Canada
- Editors:
- Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, Scott Yih
- Venue:
- RepL4NLP
- SIG:
- SIGREP
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 139–145
- Language:
- URL:
- https://aclanthology.org/W17-2617
- DOI:
- 10.18653/v1/W17-2617
- Cite (ACL):
- Pranava Swaroop Madhyastha and Cristina España-Bonet. 2017. Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 139–145, Vancouver, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation (Madhyastha & España-Bonet, RepL4NLP 2017)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/W17-2617.pdf