The CONCISUS Corpus of Event Summaries

Horacio Saggion, Sandra Szasz


Abstract
Text summarization and information extraction systems require adaptation to new domains and languages. This adaptation usually depends on the availability of language resources such as corpora. In this paper we present a comparable corpus in Spanish and English for the study of cross-lingual information extraction and summarization: the CONCISUS Corpus. It is a rich human-annotated dataset composed of comparable event summaries in Spanish and English covering four different domains: aviation accidents, rail accidents, earthquakes, and terrorist attacks. In addition to the monolingual summaries in English and Spanish, we provide automatic translations and ``comparable'' full event reports of the events. The human annotations are concepts marked in the textual sources representing the key event information associated to the event type. The dataset has also been annotated using text processing pipelines. It is being made freely available to the research community for research purposes.
Anthology ID:
L12-1372
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2031–2037
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/642_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Horacio Saggion and Sandra Szasz. 2012. The CONCISUS Corpus of Event Summaries. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2031–2037, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
The CONCISUS Corpus of Event Summaries (Saggion & Szasz, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/642_Paper.pdf