Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder

Xiaobao Wu, Chunping Li, Yan Zhu, Yishu Miao


Abstract
Topic models have been prevailing for many years on discovering latent semantics while modeling long documents. However, for short texts they generally suffer from data sparsity because of extremely limited word co-occurrences; thus tend to yield repetitive or trivial topics with low quality. In this paper, to address this issue, we propose a novel neural topic model in the framework of autoencoding with a new topic distribution quantization approach generating peakier distributions that are more appropriate for modeling short texts. Besides the encoding, to tackle this issue in terms of decoding, we further propose a novel negative sampling decoder learning from negative samples to avoid yielding repetitive topics. We observe that our model can highly improve short text topic modeling performance. Through extensive experiments on real-world datasets, we demonstrate our model can outperform both strong traditional and neural baselines under extreme data sparsity scenes, producing high-quality topics.
Anthology ID:
2020.emnlp-main.138
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1772–1782
Language:
URL:
https://aclanthology.org/2020.emnlp-main.138
DOI:
10.18653/v1/2020.emnlp-main.138
Bibkey:
Cite (ACL):
Xiaobao Wu, Chunping Li, Yan Zhu, and Yishu Miao. 2020. Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1772–1782, Online. Association for Computational Linguistics.
Cite (Informal):
Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder (Wu et al., EMNLP 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.emnlp-main.138.pdf
Code
 bobxwu/NQTM