Kerria Pang-Naylor

2025

Given the inherent subjectivity of similarity in text, fully unsupervised text clustering is unlikely to produce groupings that work across a variety of use cases. Traditional techniques to guide clustering rely on costly, time-consuming human feedback and/or pre-existing labels. Leveraging recent advancements in LLMs and decoder-only embedding models, we present techniques to effectively control text embeddings with minimal human input: prefix instructions and LLM preprocessing. We evaluate clustering performance for datasets with multiple independent ground-truth labels, or perspectives, and find that these techniques can be used to improve clustering for one perspective or use case, at the cost of a tradeoff in performance for another use case.

Co-authors

Khyati Mahajan 1

Shivani Manivasagan 1

Nicholas Mondello 1

Aitong Zhong 1

Venues

emnlp1

Fix author