SlovakSum: A Large Scale Slovak Summarization Dataset

Viktoria Ondrejova, Marek Suppa


Abstract
The ability to automatically summarize news articles has become increasingly important due to the vast amount of information available online. Together with the rise of chatbots , Natural Language Processing (NLP) has recently experienced a tremendous amount of development. Despite these advancements, the majority of research is focused on established well-resourced languages, such as English. To contribute to development of the low resource Slovak language, we introduce SlovakSum, a Slovak news summarization dataset consisting of over 200 thousand news articles with titles and short abstracts obtained from multiple Slovak newspapers. The abstractive approach, including MBART and mT5 models, was used to evaluate various baselines. The code for the reproduction of our dataset and experiments can be found at https://github.com/NaiveNeuron/slovaksum
Anthology ID:
2024.lrec-main.1298
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
14916–14922
Language:
URL:
https://aclanthology.org/2024.lrec-main.1298
DOI:
Bibkey:
Cite (ACL):
Viktoria Ondrejova and Marek Suppa. 2024. SlovakSum: A Large Scale Slovak Summarization Dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14916–14922, Torino, Italia. ELRA and ICCL.
Cite (Informal):
SlovakSum: A Large Scale Slovak Summarization Dataset (Ondrejova & Suppa, LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.lrec-main.1298.pdf