Multilingual Culture-Independent Word Analogy Datasets

Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja


Abstract
In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We designed the monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.
Anthology ID:
2020.lrec-1.501
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4074–4080
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.501
DOI:
Bibkey:
Cite (ACL):
Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, and Marko Robnik-Šikonja. 2020. Multilingual Culture-Independent Word Analogy Datasets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4074–4080, Marseille, France. European Language Resources Association.
Cite (Informal):
Multilingual Culture-Independent Word Analogy Datasets (Ulčar et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nodalida-main-page/2020.lrec-1.501.pdf