The Polish Summaries Corpus

Maciej Ogrodniczuk, Mateusz Kopeć


Abstract
This article presents the Polish Summaries Corpus, a new resource created to support the development and evaluation of the tools for automated single-document summarization of Polish. The Corpus contains a large number of manual summaries of news articles, with many independently created summaries for a single text. Such approach is supposed to overcome the annotator bias, which is often described as a problem during the evaluation of the summarization algorithms against a single gold standard. There are several summarizers developed specifically for Polish language, but their in-depth evaluation and comparison was impossible without a large, manually created corpus. We present in detail the process of text selection, annotation process and the contents of the corpus, which includes both abstract free-word summaries, as well as extraction-based summaries created by selecting text spans from the original document. Finally, we describe how that resource could be used not only for the evaluation of the existing summarization tools, but also for studies on the human summarization process in Polish language.
Anthology ID:
L14-1145
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3712–3715
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1211_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Maciej Ogrodniczuk and Mateusz Kopeć. 2014. The Polish Summaries Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3712–3715, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
The Polish Summaries Corpus (Ogrodniczuk & Kopeć, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1211_Paper.pdf