Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder

Yunsu Kim, Jiahui Geng, Hermann Ney


Abstract
Unsupervised learning of cross-lingual word embedding offers elegant matching of words across languages, but has fundamental limitations in translating sentences. In this paper, we propose simple yet effective methods to improve word-by-word translation of cross-lingual embeddings, using only monolingual corpora but without any back-translation. We integrate a language model for context-aware search, and use a novel denoising autoencoder to handle reordering. Our system surpasses state-of-the-art unsupervised translation systems without costly iterative training. We also analyze the effect of vocabulary size and denoising type on the translation performance, which provides better understanding of learning the cross-lingual word embedding and its usage in translation.
Anthology ID:
D18-1101
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
862–868
Language:
URL:
https://aclanthology.org/D18-1101
DOI:
10.18653/v1/D18-1101
Bibkey:
Cite (ACL):
Yunsu Kim, Jiahui Geng, and Hermann Ney. 2018. Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 862–868, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder (Kim et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/D18-1101.pdf
Video:
 https://vimeo.com/305206383