Abstract
In this paper, we propose a novel method for learning cross-lingual word embeddings, that incorporates sub-word information during training, and is able to learn high-quality embeddings from modest amounts of monolingual data and a bilingual lexicon. This method could be particularly well-suited to learning cross-lingual embeddings for lower-resource, morphologically-rich languages, enabling knowledge to be transferred from rich- to lower-resource languages. We evaluate our proposed approach simulating lower-resource languages for bilingual lexicon induction, monolingual word similarity, and document classification. Our results indicate that incorporating sub-word information indeed leads to improvements, and in the case of document classification, performance better than, or on par with, strong benchmark approaches.- Anthology ID:
- 2020.starsem-1.5
- Volume:
- Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Iryna Gurevych, Marianna Apidianaki, Manaal Faruqui
- Venue:
- *SEM
- SIG:
- SIGLEX
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 39–49
- Language:
- URL:
- https://aclanthology.org/2020.starsem-1.5
- DOI:
- Cite (ACL):
- Ali Hakimi Parizi and Paul Cook. 2020. Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora. In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pages 39–49, Barcelona, Spain (Online). Association for Computational Linguistics.
- Cite (Informal):
- Joint Training for Learning Cross-lingual Embeddings with Sub-word Information without Parallel Corpora (Hakimi Parizi & Cook, *SEM 2020)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2020.starsem-1.5.pdf
- Data
- MLDoc, RCV1