Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up
Jakub Piskorski, Nicolas Stefanovitch, Guillaume Jacquet, Aldo Podavini
Abstract
This paper presents a study of state-of-the-art unsupervised and linguistically unsophisticated keyword extraction algorithms, based on statistic-, graph-, and embedding-based approaches, including, i.a., Total Keyword Frequency, TF-IDF, RAKE, KPMiner, YAKE, KeyBERT, and variants of TextRank-based keyword extraction algorithms. The study was motivated by the need to select the most appropriate technique to extract keywords for indexing news articles in a real-world large-scale news analysis engine. The algorithms were evaluated on a corpus of circa 330 news articles in 7 languages. The overall best F1 scores for all languages on average were obtained using a combination of the recently introduced YAKE algorithm and KPMiner (20.1%, 46.6% and 47.2% for exact, partial and fuzzy matching resp.).- Anthology ID:
- 2021.hackashop-1.6
- Volume:
- Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation
- Month:
- April
- Year:
- 2021
- Address:
- Online
- Venue:
- Hackashop
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 35–44
- Language:
- URL:
- https://aclanthology.org/2021.hackashop-1.6
- DOI:
- Cite (ACL):
- Jakub Piskorski, Nicolas Stefanovitch, Guillaume Jacquet, and Aldo Podavini. 2021. Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pages 35–44, Online. Association for Computational Linguistics.
- Cite (Informal):
- Exploring Linguistically-Lightweight Keyword Extraction Techniques for Indexing News Articles in a Multilingual Set-up (Piskorski et al., Hackashop 2021)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2021.hackashop-1.6.pdf