Evaluating Sub-word Embeddings in Cross-lingual Models

Ali Hakimi Parizi, Paul Cook


Abstract
Cross-lingual word embeddings create a shared space for embeddings in two languages, and enable knowledge to be transferred between languages for tasks such as bilingual lexicon induction. One problem, however, is out-of-vocabulary (OOV) words, for which no embeddings are available. This is particularly problematic for low-resource and morphologically-rich languages, which often have relatively high OOV rates. Approaches to learning sub-word embeddings have been proposed to address the problem of OOV words, but most prior work has not considered sub-word embeddings in cross-lingual models. In this paper, we consider whether sub-word embeddings can be leveraged to form cross-lingual embeddings for OOV words. Specifically, we consider a novel bilingual lexicon induction task focused on OOV words, for language pairs covering several language families. Our results indicate that cross-lingual representations for OOV words can indeed be formed from sub-word embeddings, including in the case of a truly low-resource morphologically-rich language.
Anthology ID:
2020.lrec-1.330
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2712–2719
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.330
DOI:
Bibkey:
Cite (ACL):
Ali Hakimi Parizi and Paul Cook. 2020. Evaluating Sub-word Embeddings in Cross-lingual Models. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2712–2719, Marseille, France. European Language Resources Association.
Cite (Informal):
Evaluating Sub-word Embeddings in Cross-lingual Models (Hakimi Parizi & Cook, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.330.pdf