Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora

Ali Hakimi Parizi; Paul Cook

Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora

Abstract

In this paper, we propose a novel method for learning cross-lingual word embeddings, that incorporates sub-word information during training, and is able to learn high-quality embeddings from modest amounts of monolingual data and a bilingual lexicon. This method could be particularly well-suited to learning cross-lingual embeddings for lower-resource, morphologically-rich languages, enabling knowledge to be transferred from rich- to lower-resource languages. We evaluate our proposed approach simulating lower-resource languages for bilingual lexicon induction, monolingual word similarity, and document classification. Our results indicate that incorporating sub-word information indeed leads to improvements, and in the case of document classification, performance better than, or on par with, strong benchmark approaches.

Anthology ID:: 2020.starsem-1.5
Volume:: Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Venue:: *SEM
SIGs:: SIGLEX | SIGSEM
Publisher:: Association for Computational Linguistics
Note:
Pages:: 39–49
Language:
URL:: https://aclanthology.org/2020.starsem-1.5
DOI:
Bibkey:
Cite (ACL):: Ali Hakimi Parizi and Paul Cook. 2020. Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora. In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pages 39–49, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):: Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora (Hakimi Parizi & Cook, *SEM 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/auto-file-uploads/2020.starsem-1.5.pdf
Data: MLDoc, RCV1

PDF Search