Entropy-Based Subword Mining with an Application to Word Embeddings
Ahmed El-Kishky, Frank Xu, Aston Zhang, Stephen Macke, Jiawei Han
Abstract
Recent literature has shown a wide variety of benefits to mapping traditional one-hot representations of words and phrases to lower-dimensional real-valued vectors known as word embeddings. Traditionally, most word embedding algorithms treat each word as the finest meaningful semantic granularity and perform embedding by learning distinct embedding vectors for each word. Contrary to this line of thought, technical domains such as scientific and medical literature compose words from subword structures such as prefixes, suffixes, and root-words as well as compound words. Treating individual words as the finest-granularity unit discards meaningful shared semantic structure between words sharing substructures. This not only leads to poor embeddings for text corpora that have long-tail distributions, but also heuristic methods for handling out-of-vocabulary words. In this paper we propose SubwordMine, an entropy-based subword mining algorithm that is fast, unsupervised, and fully data-driven. We show that this allows for great cross-domain performance in identifying semantically meaningful subwords. We then investigate utilizing the mined subwords within the FastText embedding model and compare performance of the learned representations in a downstream language modeling task.- Anthology ID:
- W18-1202
- Volume:
- Proceedings of the Second Workshop on Subword/Character LEvel Models
- Month:
- June
- Year:
- 2018
- Address:
- New Orleans
- Editors:
- Manaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, Yadollah Yaghoobzadeh
- Venue:
- SCLeM
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12–21
- Language:
- URL:
- https://aclanthology.org/W18-1202
- DOI:
- 10.18653/v1/W18-1202
- Cite (ACL):
- Ahmed El-Kishky, Frank Xu, Aston Zhang, Stephen Macke, and Jiawei Han. 2018. Entropy-Based Subword Mining with an Application to Word Embeddings. In Proceedings of the Second Workshop on Subword/Character LEvel Models, pages 12–21, New Orleans. Association for Computational Linguistics.
- Cite (Informal):
- Entropy-Based Subword Mining with an Application to Word Embeddings (El-Kishky et al., SCLeM 2018)
- PDF:
- https://preview.aclanthology.org/landing_page/W18-1202.pdf