Improving Clustering with Positive Pairs Generated from LLM-Driven Labels

Xiaotong Zhang, Ying Li


Abstract
Traditional unsupervised clustering methods, which often rely on contrastive training of embedders, suffer from a lack of label knowledge, resulting in suboptimal performance. Furthermore, the presence of potential false negatives can destabilize the training process. Hence, we propose to improve clustering with Positive Pairs generated from LLM-driven Labels (PPLL). In the proposed framework, LLM is initially employed to cluster the data and generate corresponding mini-cluster labels. Subsequently, positive pairs are constructed based on these labels, and an embedder is trained using BYOL to obviate the need for negative pairs. Following training, the acquired label knowledge is integrated into K-means clustering. This framework enables the integration of label information throughout the training and inference processes, while mitigating the reliance on negative pairs. Additionally, it generates interpretable labels for improved understanding of clustering results. Empirical evaluations on a range of datasets demonstrate that our proposed framework consistently surpasses state-of-the-art baselines, achieving superior performance, robustness, and computational efficiency for diverse text clustering applications.
Anthology ID:
2025.emnlp-main.613
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12213–12229
Language:
URL:
https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.613/
DOI:
10.18653/v1/2025.emnlp-main.613
Bibkey:
Cite (ACL):
Xiaotong Zhang and Ying Li. 2025. Improving Clustering with Positive Pairs Generated from LLM-Driven Labels. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12213–12229, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Improving Clustering with Positive Pairs Generated from LLM-Driven Labels (Zhang & Li, EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.613.pdf
Checklist:
 2025.emnlp-main.613.checklist.pdf