TL;DR: Mining Reddit to Learn Automatic Summarization

Michael Völske, Martin Potthast, Shahbaz Syed, Benno Stein


Abstract
Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a “TL;DR” to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.
Anthology ID:
W17-4508
Volume:
Proceedings of the Workshop on New Frontiers in Summarization
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, Fei Liu
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
59–63
Language:
URL:
https://aclanthology.org/W17-4508
DOI:
10.18653/v1/W17-4508
Bibkey:
Cite (ACL):
Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. 2017. TL;DR: Mining Reddit to Learn Automatic Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
TL;DR: Mining Reddit to Learn Automatic Summarization (Völske et al., 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/W17-4508.pdf
Data
Webis-TLDR-17 Corpus