Abstract
The ability to automatically summarize news articles has become increasingly important due to the vast amount of information available online. Together with the rise of chatbots , Natural Language Processing (NLP) has recently experienced a tremendous amount of development. Despite these advancements, the majority of research is focused on established well-resourced languages, such as English. To contribute to development of the low resource Slovak language, we introduce SlovakSum, a Slovak news summarization dataset consisting of over 200 thousand news articles with titles and short abstracts obtained from multiple Slovak newspapers. The abstractive approach, including MBART and mT5 models, was used to evaluate various baselines. The code for the reproduction of our dataset and experiments can be found at https://github.com/NaiveNeuron/slovaksum- Anthology ID:
- 2024.lrec-main.1298
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 14916–14922
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.1298
- DOI:
- Cite (ACL):
- Viktoria Ondrejova and Marek Suppa. 2024. SlovakSum: A Large Scale Slovak Summarization Dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14916–14922, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- SlovakSum: A Large Scale Slovak Summarization Dataset (Ondrejova & Suppa, LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2024.lrec-main.1298.pdf