Cross-Lingual Word Embeddings for Morphologically Rich Languages

Ahmet Üstün, Gosse Bouma, Gertjan van Noord


Abstract
Cross-lingual word embedding models learn a shared vector space for two or more languages so that words with similar meaning are represented by similar vectors regardless of their language. Although the existing models achieve high performance on pairs of morphologically simple languages, they perform very poorly on morphologically rich languages such as Turkish and Finnish. In this paper, we propose a morpheme-based model in order to increase the performance of cross-lingual word embeddings on morphologically rich languages. Our model includes a simple extension which enables us to exploit morphemes for cross-lingual mapping. We applied our model for the Turkish-Finnish language pair on the bilingual word translation task. Results show that our model outperforms the baseline models by 2% in the nearest neighbour ranking.
Anthology ID:
R19-1140
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1222–1228
Language:
URL:
https://aclanthology.org/R19-1140
DOI:
10.26615/978-954-452-056-4_140
Bibkey:
Cite (ACL):
Ahmet Üstün, Gosse Bouma, and Gertjan van Noord. 2019. Cross-Lingual Word Embeddings for Morphologically Rich Languages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1222–1228, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Cross-Lingual Word Embeddings for Morphologically Rich Languages (Üstün et al., RANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/R19-1140.pdf