A Self-Training Approach for Short Text Clustering

Amir Hadifar, Lucas Sterckx, Thomas Demeester, Chris Develder


Abstract
Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.
Anthology ID:
W19-4322
Volume:
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Isabelle Augenstein, Spandana Gella, Sebastian Ruder, Katharina Kann, Burcu Can, Johannes Welbl, Alexis Conneau, Xiang Ren, Marek Rei
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
194–199
Language:
URL:
https://aclanthology.org/W19-4322
DOI:
10.18653/v1/W19-4322
Bibkey:
Cite (ACL):
Amir Hadifar, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. A Self-Training Approach for Short Text Clustering. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 194–199, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
A Self-Training Approach for Short Text Clustering (Hadifar et al., RepL4NLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/W19-4322.pdf