Abstract
Existing supervised models for text clustering find it difficult to directly optimize for clustering results. This is because clustering is a discrete process and it is difficult to estimate meaningful gradient of any discrete function that can drive gradient based optimization algorithms. So, existing supervised clustering algorithms indirectly optimize for some continuous function that approximates the clustering process. We propose a scalable training strategy that directly optimizes for a discrete clustering metric. We train a BERT-based embedding model using our method and evaluate it on two publicly available datasets. We show that our method outperforms another BERT-based embedding model employing Triplet loss and other unsupervised baselines. This suggests that optimizing directly for the clustering outcome indeed yields better representations suitable for clustering.- Anthology ID:
- 2021.repl4nlp-1.15
- Volume:
- Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Anna Rogers, Iacer Calixto, Ivan Vulić, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, Vered Shwartz
- Venue:
- RepL4NLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 141–151
- Language:
- URL:
- https://aclanthology.org/2021.repl4nlp-1.15
- DOI:
- 10.18653/v1/2021.repl4nlp-1.15
- Cite (ACL):
- Sumanta Kashyapi and Laura Dietz. 2021. Learn The Big Picture: Representation Learning for Clustering. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 141–151, Online. Association for Computational Linguistics.
- Cite (Informal):
- Learn The Big Picture: Representation Learning for Clustering (Kashyapi & Dietz, RepL4NLP 2021)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2021.repl4nlp-1.15.pdf
- Code
- nihilistsumo/blackbox_clustering