Zakir Durumeric


2025

pdf bib
Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings
Hans William Alexander Hanley | Zakir Durumeric
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson 𝜌 = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.

2023

pdf bib
TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings
Hans Hanley | Zakir Durumeric
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage’s stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 F1-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata.