Multilingual Clustering of Streaming News

Sebastião Miranda, Artūrs Znotiņš, Shay B. Cohen, Guntis Barzdins


Abstract
Clustering news across languages enables efficient media monitoring by aggregating articles from multilingual sources into coherent stories. Doing so in an online setting allows scalable processing of massive news streams. To this end, we describe a novel method for clustering an incoming stream of multilingual documents into monolingual and crosslingual clusters. Unlike typical clustering approaches that report results on datasets with a small and known number of labels, we tackle the problem of discovering an ever growing number of cluster labels in an online fashion, using real news datasets in multiple languages. In our formulation, the monolingual clusters group together documents while the crosslingual clusters group together monolingual clusters, one per language that appears in the stream. Our method is simple to implement, computationally efficient and produces state-of-the-art results on datasets in German, English and Spanish.
Anthology ID:
D18-1483
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4535–4544
Language:
URL:
https://aclanthology.org/D18-1483
DOI:
10.18653/v1/D18-1483
Bibkey:
Cite (ACL):
Sebastião Miranda, Artūrs Znotiņš, Shay B. Cohen, and Guntis Barzdins. 2018. Multilingual Clustering of Streaming News. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4535–4544, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Multilingual Clustering of Streaming News (Miranda et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/D18-1483.pdf
Code
 priberam/news-clustering +  additional community code