Abstract
Word embeddings are a relatively new addition to the modern NLP researcher’s toolkit. However, unlike other tools, word embeddings are used in a black box manner. There are very few studies regarding various hyperparameters. One such hyperparameter is the dimension of word embeddings. They are rather decided based on a rule of thumb: in the range 50 to 300. In this paper, we show that the dimension should instead be chosen based on corpus statistics. More specifically, we show that the number of pairwise equidistant words of the corpus vocabulary (as defined by some distance/similarity metric) gives a lower bound on the the number of dimensions , and going below this bound results in degradation of quality of learned word embeddings. Through our evaluations on standard word embedding evaluation tasks, we show that for dimensions higher than or equal to the bound, we get better results as compared to the ones below it.- Anthology ID:
- I17-2006
- Volume:
- Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
- Month:
- November
- Year:
- 2017
- Address:
- Taipei, Taiwan
- Editors:
- Greg Kondrak, Taro Watanabe
- Venue:
- IJCNLP
- SIG:
- Publisher:
- Asian Federation of Natural Language Processing
- Note:
- Pages:
- 31–36
- Language:
- URL:
- https://preview.aclanthology.org/remove-affiliations/I17-2006/
- DOI:
- Cite (ACL):
- Kevin Patel and Pushpak Bhattacharyya. 2017. Towards Lower Bounds on Number of Dimensions for Word Embeddings. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 31–36, Taipei, Taiwan. Asian Federation of Natural Language Processing.
- Cite (Informal):
- Towards Lower Bounds on Number of Dimensions for Word Embeddings (Patel & Bhattacharyya, IJCNLP 2017)
- PDF:
- https://preview.aclanthology.org/remove-affiliations/I17-2006.pdf