A Corpus of German Reddit Exchanges (GeRedE)

Andreas Blombach, Natalie Dykes, Philipp Heinrich, Besim Kabashi, Thomas Proisl


Abstract
GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.
Anthology ID:
2020.lrec-1.774
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6310–6316
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.774
DOI:
Bibkey:
Cite (ACL):
Andreas Blombach, Natalie Dykes, Philipp Heinrich, Besim Kabashi, and Thomas Proisl. 2020. A Corpus of German Reddit Exchanges (GeRedE). In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6310–6316, Marseille, France. European Language Resources Association.
Cite (Informal):
A Corpus of German Reddit Exchanges (GeRedE) (Blombach et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.774.pdf