A Corpus of German Reddit Exchanges (GeRedE)
Andreas Blombach, Natalie Dykes, Philipp Heinrich, Besim Kabashi, Thomas Proisl
Abstract
GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.- Anthology ID:
- 2020.lrec-1.774
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 6310–6316
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.774
- DOI:
- Cite (ACL):
- Andreas Blombach, Natalie Dykes, Philipp Heinrich, Besim Kabashi, and Thomas Proisl. 2020. A Corpus of German Reddit Exchanges (GeRedE). In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6310–6316, Marseille, France. European Language Resources Association.
- Cite (Informal):
- A Corpus of German Reddit Exchanges (GeRedE) (Blombach et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.774.pdf