LLMs Enable Bag-of-Texts Representations for Short-Text Clustering

I-Fan Lin, Faegheh Hasibi, Suzan Verberne


Abstract
In this paper, we propose a training-free method for unsupervised short text clustering that relies less on careful selection of embedders than other methods. In customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these settings, no labeled data is typically available, and the number of clusters is not known. Recent approaches to short-text clustering in label-free settings incorporate LLM output to refine existing embeddings. While LLMs can identify similar texts effectively, the resulting similarities may not be directly represented by distances in the dense vector space, as they depend on the original embedding. We therefore propose a method for transforming LLM judgments directly into a bag-of-texts representation in which texts are initialized to be equidistant, without assuming any prior distance relationships. Our method achieves comparable or superior results to state-of-the-art methods, but without embeddings optimization or assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show how our method scales to large datasets, reducing the computational cost of the LLM use. The flexibility and scalability of our method make it more aligned with real-world training-free scenarios than existing clustering methods.
Anthology ID:
2026.acl-long.291
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6432–6447
Language:
URL:
https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.291/
DOI:
Bibkey:
Cite (ACL):
I-Fan Lin, Faegheh Hasibi, and Suzan Verberne. 2026. LLMs Enable Bag-of-Texts Representations for Short-Text Clustering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6432–6447, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
LLMs Enable Bag-of-Texts Representations for Short-Text Clustering (Lin et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.291.pdf
Checklist:
 2026.acl-long.291.checklist.pdf