Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up

Jakub Piskorski, Nicolas Stefanovitch, Guillaume Jacquet, Aldo Podavini


Abstract
This paper presents a study of state-of-the-art unsupervised and linguistically unsophisticated keyword extraction algorithms, based on statistic-, graph-, and embedding-based approaches, including, i.a., Total Keyword Frequency, TF-IDF, RAKE, KPMiner, YAKE, KeyBERT, and variants of TextRank-based keyword extraction algorithms. The study was motivated by the need to select the most appropriate technique to extract keywords for indexing news articles in a real-world large-scale news analysis engine. The algorithms were evaluated on a corpus of circa 330 news articles in 7 languages. The overall best F1 scores for all languages on average were obtained using a combination of the recently introduced YAKE algorithm and KPMiner (20.1%, 46.6% and 47.2% for exact, partial and fuzzy matching resp.).
Anthology ID:
2021.hackashop-1.6
Volume:
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation
Month:
April
Year:
2021
Address:
Online
Venue:
Hackashop
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35–44
Language:
URL:
https://aclanthology.org/2021.hackashop-1.6
DOI:
Bibkey:
Cite (ACL):
Jakub Piskorski, Nicolas Stefanovitch, Guillaume Jacquet, and Aldo Podavini. 2021. Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pages 35–44, Online. Association for Computational Linguistics.
Cite (Informal):
Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up (Piskorski et al., Hackashop 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2021.hackashop-1.6.pdf