Abstract
Most previous work in unsupervised semantic modeling in the presence of metadata has assumed that our goal is to make latent dimensions more correlated with metadata, but in practice the exact opposite is often true. Some users want topic models that highlight differences between, for example, authors, but others seek more subtle connections across authors. We introduce three metrics for identifying topics that are highly correlated with metadata, and demonstrate that this problem affects between 30 and 50% of the topics in models trained on two real-world collections, regardless of the size of the model. We find that we can predict which words cause this phenomenon and that by selectively subsampling these words we dramatically reduce topic-metadata correlation, improve topic stability, and maintain or even improve model quality.- Anthology ID:
- C18-1329
- Volume:
- Proceedings of the 27th International Conference on Computational Linguistics
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Emily M. Bender, Leon Derczynski, Pierre Isabelle
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3903–3914
- Language:
- URL:
- https://aclanthology.org/C18-1329
- DOI:
- Cite (ACL):
- Laure Thompson and David Mimno. 2018. Authorless Topic Models: Biasing Models Away from Known Structure. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3903–3914, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- Authorless Topic Models: Biasing Models Away from Known Structure (Thompson & Mimno, COLING 2018)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/C18-1329.pdf
- Code
- laurejt/authorless-tms