Dataset Reproducibility and IR Methods in Timeline Summarization

Leo Born; Maximilian Bacher; Katja Markert

Dataset Reproducibility and IR Methods in Timeline Summarization

Leo Born, Maximilian Bacher, Katja Markert

Abstract

Timeline summarization (TLS) generates a dated overview of real-world events based on event-specific corpora. The two standard datasets for this task were collected using Google searches for news reports on given events. Not only is this IR method not reproducible at different search times, it also uses components (such as document popularity) that are not always available for any large news corpus. It is unclear how TLS algorithms fare when provided with event corpora collected with varying IR methods. We therefore construct event-specific corpora from a large static background corpus, the newsroom dataset, using differing, relatively simple IR methods based on raw text alone. We show that the choice of IR method plays a crucial role in the performance of various TLS algorithms. A weak TLS algorithm can even match a stronger one by employing a stronger IR method in the data collection phase. Furthermore, the results of TLS systems are often highly sensitive to additional sentence filtering. We consequently advocate for integrating IR into the development of TLS systems and having a common static background corpus for evaluation of TLS systems.

Anthology ID:: 2020.lrec-1.218
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1763–1771
Language:: English
URL:: https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.218/
DOI:
Bibkey:
Cite (ACL):: Leo Born, Maximilian Bacher, and Katja Markert. 2020. Dataset Reproducibility and IR Methods in Timeline Summarization. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1763–1771, Marseille, France. European Language Resources Association.
Cite (Informal):: Dataset Reproducibility and IR Methods in Timeline Summarization (Born et al., LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.218.pdf

PDF Cite Search Fix data