Predicting Word Embeddings Variability

Bénédicte Pierrejean, Ludovic Tanguy


Abstract
Neural word embeddings models (such as those built with word2vec) are known to have stability problems: when retraining a model with the exact same hyperparameters, words neighborhoods may change. We propose a method to estimate such variation, based on the overlap of neighbors of a given word in two models trained with identical hyperparameters. We show that this inherent variation is not negligible, and that it does not affect every word in the same way. We examine the influence of several features that are intrinsic to a word, corpus or embedding model and provide a methodology that can predict the variability (and as such, reliability) of a word representation in a semantic vector space.
Anthology ID:
S18-2019
Volume:
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics
Month:
June
Year:
2018
Address:
New Orleans, Louisiana
Venue:
SemEval
SIGs:
SIGLEX | SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
154–159
Language:
URL:
https://aclanthology.org/S18-2019
DOI:
10.18653/v1/S18-2019
Bibkey:
Cite (ACL):
Bénédicte Pierrejean and Ludovic Tanguy. 2018. Predicting Word Embeddings Variability. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 154–159, New Orleans, Louisiana. Association for Computational Linguistics.
Cite (Informal):
Predicting Word Embeddings Variability (Pierrejean & Tanguy, SemEval 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/S18-2019.pdf