Multilingual Culture-Independent Word Analogy Datasets
Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja
Abstract
In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We designed the monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.- Anthology ID:
- 2020.lrec-1.501
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4074–4080
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.501
- DOI:
- Cite (ACL):
- Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, and Marko Robnik-Šikonja. 2020. Multilingual Culture-Independent Word Analogy Datasets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4074–4080, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Multilingual Culture-Independent Word Analogy Datasets (Ulčar et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2020.lrec-1.501.pdf