CCSum: A Large-Scale and High-Quality Dataset for Abstractive News Summarization

Xiang Jiang, Markus Dreyer


Abstract
Training a supervised news summarization model requires large amounts of high-quality training data consisting of news articles paired with reference summaries. However, obtaining such data is costly, and existing datasets contain considerable amount of noise. We present a new large-scale and high-quality dataset for supervised abstractive news summarization containing 1.3 million training samples, which we call CCSum. In creating this dataset, we take advantage of the journalistic inverted-pyramid style in news writing: In some articles, the first sentence can be considered a summary of the reported story. Accordingly, among 35 million CommonCrawl News articles, we identify pairs of articles about the same news story and use one article’s first sentence as the summary for the other article. To ensure high quality, we apply strict filters whose parameters we optimize using Bayesian optimization. We show that the resulting dataset is more factual and informative than established summarization datasets; less than 1% of the summaries have major factual inconsistencies with the corresponding news articles, compared to 5.5% to 15.4% in existing datasets, according to our human evaluation. Summarization models trained on our dataset are more favored compared to those trained on CNN/Daily Mail. The proposed dataset can open new opportunities for future research in abstractive summarization.
Anthology ID:
2024.naacl-long.406
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7306–7336
Language:
URL:
https://aclanthology.org/2024.naacl-long.406
DOI:
10.18653/v1/2024.naacl-long.406
Bibkey:
Cite (ACL):
Xiang Jiang and Markus Dreyer. 2024. CCSum: A Large-Scale and High-Quality Dataset for Abstractive News Summarization. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7306–7336, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
CCSum: A Large-Scale and High-Quality Dataset for Abstractive News Summarization (Jiang & Dreyer, NAACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2024.naacl-long.406.pdf