MIST: Mutual Information Maximization for Short Text Clustering
Krissanee Kamthawee, Can Udomcharoenchaikit, Sarana Nutanong
Abstract
Short text clustering poses substantial challenges due to the limited amount of information provided by each text sample. Previous efforts based on dense representations are still inadequate as texts are not sufficiently segregated in the embedding space before clustering. Even though the state-of-the-art method utilizes contrastive learning to boost performance, the process of summarizing all local tokens to form a sequence representation for the whole text includes noise that may obscure limited key information. We propose Mutual Information Maximization Framework for Short Text Clustering (MIST), which overcomes the information drown-out by including a mechanism to maximize the mutual information between representations on both sequence and token levels. Experimental results across eight standard short text datasets show that MIST outperforms the state-of-the-art method in terms of Accuracy or Normalized Mutual Information in most cases.- Anthology ID:
- 2024.acl-long.610
- Volume:
- Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11309–11324
- Language:
- URL:
- https://aclanthology.org/2024.acl-long.610
- DOI:
- Cite (ACL):
- Krissanee Kamthawee, Can Udomcharoenchaikit, and Sarana Nutanong. 2024. MIST: Mutual Information Maximization for Short Text Clustering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11309–11324, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- MIST: Mutual Information Maximization for Short Text Clustering (Kamthawee et al., ACL 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.acl-long.610.pdf