Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training

Zetong Li, Qinliang Su, Shijing Si, Jianxing Yu


Abstract
BERT and TFIDF features excel in capturing rich semantics and important words, respectively. Since most existing clustering methods are solely based on the BERT model, they often fall short in utilizing keyword information, which, however, is very useful in clustering short texts. In this paper, we propose a **CO**-**T**raining **C**lustering (**COTC**) framework to make use of the collective strengths of BERT and TFIDF features. Specifically, we develop two modules responsible for the clustering of BERT and TFIDF features, respectively. We use the deep representations and cluster assignments from the TFIDF module outputs to guide the learning of the BERT module, seeking to align them at both the representation and cluster levels. Reversely, we also use the BERT module outputs to train the TFIDF module, thus leading to the mutual promotion. We then show that the alternating co-training framework can be placed under a unified joint training objective, which allows the two modules to be connected tightly and the training signals to be propagated efficiently. Experiments on eight benchmark datasets show that our method outperforms current SOTA methods significantly.
Anthology ID:
2024.emnlp-main.828
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14897–14913
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2024.emnlp-main.828/
DOI:
10.18653/v1/2024.emnlp-main.828
Bibkey:
Cite (ACL):
Zetong Li, Qinliang Su, Shijing Si, and Jianxing Yu. 2024. Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14897–14913, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training (Li et al., EMNLP 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2024.emnlp-main.828.pdf
Software:
 2024.emnlp-main.828.software.zip
Data:
 2024.emnlp-main.828.data.zip