Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora

Ali Hakimi Parizi, Paul Cook


Abstract
In this paper, we propose a novel method for learning cross-lingual word embeddings, that incorporates sub-word information during training, and is able to learn high-quality embeddings from modest amounts of monolingual data and a bilingual lexicon. This method could be particularly well-suited to learning cross-lingual embeddings for lower-resource, morphologically-rich languages, enabling knowledge to be transferred from rich- to lower-resource languages. We evaluate our proposed approach simulating lower-resource languages for bilingual lexicon induction, monolingual word similarity, and document classification. Our results indicate that incorporating sub-word information indeed leads to improvements, and in the case of document classification, performance better than, or on par with, strong benchmark approaches.
Anthology ID:
2020.starsem-1.5
Volume:
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
*SEM
SIGs:
SIGLEX | SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
39–49
Language:
URL:
https://aclanthology.org/2020.starsem-1.5
DOI:
Bibkey:
Cite (ACL):
Ali Hakimi Parizi and Paul Cook. 2020. Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora. In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pages 39–49, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora (Hakimi Parizi & Cook, *SEM 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.starsem-1.5.pdf
Data
MLDocRCV1