Abstract
We present a new corpus of German tweets. Due to the relatively small number of German messages on Twitter, it is possible to collect a virtually complete snapshot of German twitter messages over a period of time. In this paper, we present our collection method which produced a 24 million tweet corpus, representing a large majority of all German tweets sent in April, 2013. Further, we analyze this representative data set and characterize the German twitterverse. While German Twitter data is similar to other Twitter data in terms of its temporal distribution, German Twitter users are much more reluctant to share geolocation information with their tweets. Finally, the corpus collection method allows for a study of discourse phenomena in the Twitter data, structured into discussion threads.- Anthology ID:
- L14-1101
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2284–2289
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1146_Paper.pdf
- DOI:
- Cite (ACL):
- Tatjana Scheffler. 2014. A German Twitter Snapshot. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2284–2289, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- A German Twitter Snapshot (Scheffler, LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1146_Paper.pdf