Synthetic, yet natural: Properties of WordNet random walk corpora and the impact of rare words on embedding performance

Filip Klubička, Alfredo Maldonado, Abhijit Mahalunkar, John Kelleher


Abstract
Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find that the distributions in the psuedo-corpora exhibit properties found in natural corpora, such as Zipf’s and Heaps’ law, and also observe that the proportion of rare words in a pseudo-corpus affects the performance of its embeddings on word similarity.
Anthology ID:
2019.gwc-1.18
Volume:
Proceedings of the 10th Global Wordnet Conference
Month:
July
Year:
2019
Address:
Wroclaw, Poland
Venue:
GWC
SIG:
Publisher:
Global Wordnet Association
Note:
Pages:
140–150
Language:
URL:
https://aclanthology.org/2019.gwc-1.18
DOI:
Bibkey:
Cite (ACL):
Filip Klubička, Alfredo Maldonado, Abhijit Mahalunkar, and John Kelleher. 2019. Synthetic, yet natural: Properties of WordNet random walk corpora and the impact of rare words on embedding performance. In Proceedings of the 10th Global Wordnet Conference, pages 140–150, Wroclaw, Poland. Global Wordnet Association.
Cite (Informal):
Synthetic, yet natural: Properties of WordNet random walk corpora and the impact of rare words on embedding performance (Klubička et al., GWC 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-url/2019.gwc-1.18.pdf