Abstract
Prior works investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically, this post-processing step is equivalent to applying a linear autoencoder to minimize the squared L2 reconstruction error. This result contradicts prior work (Mu and Viswanath, 2018) that proposed to remove the top principal components from pre-trained embeddings. We experimentally verify our theoretical claims and show that retaining the top principal components is indeed useful for improving pre-trained word embeddings, without requiring access to additional linguistic resources or labeled data.- Anthology ID:
- 2020.coling-main.149
- Volume:
- Proceedings of the 28th International Conference on Computational Linguistics
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Donia Scott, Nuria Bel, Chengqing Zong
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 1699–1713
- Language:
- URL:
- https://aclanthology.org/2020.coling-main.149
- DOI:
- 10.18653/v1/2020.coling-main.149
- Cite (ACL):
- Masahiro Kaneko and Danushka Bollegala. 2020. Autoencoding Improves Pre-trained Word Embeddings. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1699–1713, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Cite (Informal):
- Autoencoding Improves Pre-trained Word Embeddings (Kaneko & Bollegala, COLING 2020)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2020.coling-main.149.pdf