Multilingual Culture-Independent Word Analogy Datasets

Matej Ulčar; Kristiina Vaik; Jessica Lindström; Milda Dailidėnaitė; Marko Robnik-Šikonja

Multilingual Culture-Independent Word Analogy Datasets

Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja

Abstract

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We designed the monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.

Anthology ID:: 2020.lrec-1.501
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 4074–4080
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.501
DOI:
Bibkey:
Cite (ACL):: Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, and Marko Robnik-Šikonja. 2020. Multilingual Culture-Independent Word Analogy Datasets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4074–4080, Marseille, France. European Language Resources Association.
Cite (Informal):: Multilingual Culture-Independent Word Analogy Datasets (Ulčar et al., LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/nodalida-main-page/2020.lrec-1.501.pdf

PDF Search