Abstract
The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons: (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.- Anthology ID:
- 2020.eval4nlp-1.17
- Volume:
- Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Editors:
- Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, Eduard Hovy
- Venue:
- Eval4NLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 176–186
- Language:
- URL:
- https://aclanthology.org/2020.eval4nlp-1.17
- DOI:
- 10.18653/v1/2020.eval4nlp-1.17
- Cite (ACL):
- Nathan Stringham and Mike Izbicki. 2020. Evaluating Word Embeddings on Low-Resource Languages. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 176–186, Online. Association for Computational Linguistics.
- Cite (Informal):
- Evaluating Word Embeddings on Low-Resource Languages (Stringham & Izbicki, Eval4NLP 2020)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2020.eval4nlp-1.17.pdf