Yoram Louzoun

2025

Knowledge discovery from large-scale, heterogeneous textual corpora presents a significant challenge. Document clustering offers a practical solution by organizing unstructured texts into coherent groups based on content and thematic similarity. However, clustering does not inherently ensure thematic consistency. Here, we propose a novel framework that constructs a similarity graph over document embeddings and applies iterative graph-based clustering algorithms to partition the corpus into initial clusters. To overcome the limitations of conventional methods in producing semantically consistent clusters, we incorporate iterative feedback from a large language model (LLM) to guide the refinement process. The LLM is used to assess cluster quality and adjust edge weights within the graph, promoting better intra-cluster cohesion and inter-cluster separation. The LLM guidance is based on a set of success Rate metrics that we developed to measure the semantic coherence of clusters. Experimental results on multiple benchmark datasets demonstrate that the iterative process and additional user-supplied a priori edges improve the summaries’ consistency and fluency, highlighting the importance of known connections among the documents. The removal of very rare or very frequent sentences has a mixed effect on the quality scores.Our full code is available here: https://github.com/D2CS-sub/D2CS

2024

pdf bib abs
Data-driven Coreference-based Ontology Building
Shir Ashury Tahan | Amir David Nissan Cohen | Nadav Cohen | Yoram Louzoun | Yoav Goldberg
Findings of the Association for Computational Linguistics: EMNLP 2024

While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology under a creative-commons license.

Co-authors

Dekel Cohen 1

Yoav Goldberg 1

Etzion Harari 1

Regev Yehezkel Imra 1

Venues

findings2

Fix author