Multilingual Clustering of Streaming News
Sebastião Miranda, Artūrs Znotiņš, Shay B. Cohen, Guntis Barzdins
Abstract
Clustering news across languages enables efficient media monitoring by aggregating articles from multilingual sources into coherent stories. Doing so in an online setting allows scalable processing of massive news streams. To this end, we describe a novel method for clustering an incoming stream of multilingual documents into monolingual and crosslingual clusters. Unlike typical clustering approaches that report results on datasets with a small and known number of labels, we tackle the problem of discovering an ever growing number of cluster labels in an online fashion, using real news datasets in multiple languages. In our formulation, the monolingual clusters group together documents while the crosslingual clusters group together monolingual clusters, one per language that appears in the stream. Our method is simple to implement, computationally efficient and produces state-of-the-art results on datasets in German, English and Spanish.- Anthology ID:
- D18-1483
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
- Month:
- October-November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4535–4544
- Language:
- URL:
- https://aclanthology.org/D18-1483
- DOI:
- 10.18653/v1/D18-1483
- Cite (ACL):
- Sebastião Miranda, Artūrs Znotiņš, Shay B. Cohen, and Guntis Barzdins. 2018. Multilingual Clustering of Streaming News. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4535–4544, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Multilingual Clustering of Streaming News (Miranda et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/D18-1483.pdf
- Code
- priberam/news-clustering + additional community code