Evaluating Word Embeddings on Low-Resource Languages

Nathan Stringham, Mike Izbicki


Abstract
The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons: (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.
Anthology ID:
2020.eval4nlp-1.17
Volume:
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | Eval4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
176–186
Language:
URL:
https://aclanthology.org/2020.eval4nlp-1.17
DOI:
10.18653/v1/2020.eval4nlp-1.17
Bibkey:
Cite (ACL):
Nathan Stringham and Mike Izbicki. 2020. Evaluating Word Embeddings on Low-Resource Languages. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 176–186, Online. Association for Computational Linguistics.
Cite (Informal):
Evaluating Word Embeddings on Low-Resource Languages (Stringham & Izbicki, Eval4NLP 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2020.eval4nlp-1.17.pdf
Video:
 https://slideslive.com/38939712