TL;DR: Mining Reddit to Learn Automatic Summarization

Michael Völske; Martin Potthast; Shahbaz Syed; Benno Stein

doi:10.18653/v1/W17-4508

TL;DR: Mining Reddit to Learn Automatic Summarization

Michael Völske, Martin Potthast, Shahbaz Syed, Benno Stein

Abstract

Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a “TL;DR” to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.

Anthology ID:: W17-4508
Volume:: Proceedings of the Workshop on New Frontiers in Summarization
Month:: September
Year:: 2017
Address:: Copenhagen, Denmark
Editors:: Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, Fei Liu
Venue:: WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 59–63
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/W17-4508/
DOI:: 10.18653/v1/W17-4508
Bibkey:
Cite (ACL):: Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. 2017. TL;DR: Mining Reddit to Learn Automatic Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):: TL;DR: Mining Reddit to Learn Automatic Summarization (Völske et al., 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/W17-4508.pdf
Data: Webis-TLDR-17 Corpus

PDF Cite Search Fix data