Abstract
In this paper we propose a new framework and new methods for the reference-free evaluation of topic segmentation systems directly in the embedding space. Specifically, we define a common framework for reference-free, embedding-based topic segmentation metrics, and show how this applies to an existing metric. We then define new metrics, based on a previously defined cohesion score, Average Relative Proximity. Using this approach, we show that Large Language Models (LLMs) yield features that, if used correctly, can strongly correlate with traditional topic segmentation metrics based on costly and rare human annotations, while outperforming existing reference-free metrics borrowed from clustering evaluation in most domains. We then show that smaller language models specifically fine-tuned for different sentence-level tasks can outperform LLMs several orders of magnitude larger. Via a thorough comparison of our metric’s performance across different datasets, we see that conversational data present the biggest challenge in this framework. Finally, we analyse the behaviour of our metrics in specific error cases, such as those of under-generation and moving of ground truth topic boundaries, and show that our metrics behave more consistently than other reference-free methods.