A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, Georgiana Ifrim


Abstract
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.
Anthology ID:
2020.acl-main.120
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1302–1308
Language:
URL:
https://aclanthology.org/2020.acl-main.120
DOI:
10.18653/v1/2020.acl-main.120
Bibkey:
Cite (ACL):
Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, and Georgiana Ifrim. 2020. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1302–1308, Online. Association for Computational Linguistics.
Cite (Informal):
A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal (Gholipour Ghalandari et al., ACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.acl-main.120.pdf
Video:
 http://slideslive.com/38929109
Code
 complementizer/wcep-mds-dataset
Data
WCEP