Unsupervised Multilingual Word Embedding with Limited Resources using Neural Language Models

Takashi Wada, Tomoharu Iwata, Yuji Matsumoto


Abstract
Recently, a variety of unsupervised methods have been proposed that map pre-trained word embeddings of different languages into the same space without any parallel data. These methods aim to find a linear transformation based on the assumption that monolingual word embeddings are approximately isomorphic between languages. However, it has been demonstrated that this assumption holds true only on specific conditions, and with limited resources, the performance of these methods decreases drastically. To overcome this problem, we propose a new unsupervised multilingual embedding method that does not rely on such assumption and performs well under resource-poor scenarios, namely when only a small amount of monolingual data (i.e., 50k sentences) are available, or when the domains of monolingual data are different across languages. Our proposed model, which we call ‘Multilingual Neural Language Models’, shares some of the network parameters among multiple languages, and encodes sentences of multiple languages into the same space. The model jointly learns word embeddings of different languages in the same space, and generates multilingual embeddings without any parallel data or pre-training. Our experiments on word alignment tasks have demonstrated that, on the low-resource condition, our model substantially outperforms existing unsupervised and even supervised methods trained with 500 bilingual pairs of words. Our model also outperforms unsupervised methods given different-domain corpora across languages. Our code is publicly available.
Anthology ID:
P19-1300
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3113–3124
Language:
URL:
https://aclanthology.org/P19-1300
DOI:
10.18653/v1/P19-1300
Bibkey:
Cite (ACL):
Takashi Wada, Tomoharu Iwata, and Yuji Matsumoto. 2019. Unsupervised Multilingual Word Embedding with Limited Resources using Neural Language Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3113–3124, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Multilingual Word Embedding with Limited Resources using Neural Language Models (Wada et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-dup-bibkey/P19-1300.pdf
Code
 twadada/multilingual-nlm