Abstract
Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a “TL;DR” to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.- Anthology ID:
- W17-4508
- Volume:
- Proceedings of the Workshop on New Frontiers in Summarization
- Month:
- September
- Year:
- 2017
- Address:
- Copenhagen, Denmark
- Editors:
- Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, Fei Liu
- Venue:
- WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 59–63
- Language:
- URL:
- https://aclanthology.org/W17-4508
- DOI:
- 10.18653/v1/W17-4508
- Cite (ACL):
- Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. 2017. TL;DR: Mining Reddit to Learn Automatic Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark. Association for Computational Linguistics.
- Cite (Informal):
- TL;DR: Mining Reddit to Learn Automatic Summarization (Völske et al., 2017)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/W17-4508.pdf
- Data
- Webis-TLDR-17 Corpus