Rethinking Topic Modelling: From Document-Space to Term-Space

Magnus Sahlgren


Abstract
This paper problematizes the reliance on documents as the basic notion for defining term interactions in standard topic models. As an alternative to this practice, we reformulate topic distributions as latent factors in term similarity space. We exemplify the idea using a number of standard word embeddings built with very wide context windows. The embedding spaces are transformed to sparse similarity spaces, and topics are extracted in standard fashion by factorizing to a lower-dimensional space. We use a number of different factorization techniques, and evaluate the various models using a large set of evaluation metrics, including previously published coherence measures, as well as a number of novel measures that we suggest better correspond to real-world applications of topic models. Our results clearly demonstrate that term-based models outperform standard document-based models by a large margin.
Anthology ID:
2020.findings-emnlp.204
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2250–2259
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.204
DOI:
10.18653/v1/2020.findings-emnlp.204
Bibkey:
Cite (ACL):
Magnus Sahlgren. 2020. Rethinking Topic Modelling: From Document-Space to Term-Space. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2250–2259, Online. Association for Computational Linguistics.
Cite (Informal):
Rethinking Topic Modelling: From Document-Space to Term-Space (Sahlgren, Findings 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.findings-emnlp.204.pdf