A Summarization Dataset of Slovak News Articles

Marek Suppa, Jergus Adamec


Abstract
As a well established NLP task, single-document summarization has seen significant interest in the past few years. However, most of the work has been done on English datasets. This is particularly noticeable in the context of evaluation where the dominant ROUGE metric assumes its input to be written in English. In this paper we aim to address both of these issues by introducing a summarization dataset of articles from a popular Slovak news site and proposing small adaptation to the ROUGE metric that make it better suited for Slovak texts. Several baselines are evaluated on the dataset, including an extractive approach based on the Multilingual version of the BERT architecture. To the best of our knowledge, the presented dataset is the first large-scale news-based summarization dataset for text written in Slovak language. It can be reproduced using the utilities available at https://github.com/NaiveNeuron/sme-sum
Anthology ID:
2020.lrec-1.830
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6725–6730
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.830
DOI:
Bibkey:
Cite (ACL):
Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.
Cite (Informal):
A Summarization Dataset of Slovak News Articles (Suppa & Adamec, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2020.lrec-1.830.pdf