CodE Alltag: A German-Language E-Mail Corpus
Ulrike Krieg-Holz, Christian Schuschnig, Franz Matthies, Benjamin Redling, Udo Hahn
Abstract
We introduce CODE ALLTAG, a text corpus composed of German-language e-mails. It is divided into two partitions: the first of these portions, CODE ALLTAG_XL, consists of a bulk-size collection drawn from an openly accessible e-mail archive (roughly 1.5M e-mails), whereas the second portion, CODE ALLTAG_S+d, is much smaller in size (less than thousand e-mails), yet excels with demographic data from each author of an e-mail. CODE ALLTAG, thus, currently constitutes the largest E-Mail corpus ever built. In this paper, we describe, for both parts, the solicitation process for gathering e-mails, present descriptive statistical properties of the corpus, and, for CODE ALLTAG_S+d, reveal a compilation of demographic features of the donors of e-mails.- Anthology ID:
 - L16-1404
 - Volume:
 - Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
 - Month:
 - May
 - Year:
 - 2016
 - Address:
 - Portorož, Slovenia
 - Editors:
 - Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
 - Venue:
 - LREC
 - SIG:
 - Publisher:
 - European Language Resources Association (ELRA)
 - Note:
 - Pages:
 - 2543–2550
 - Language:
 - URL:
 - https://preview.aclanthology.org/landing_page/L16-1404/
 - DOI:
 - Cite (ACL):
 - Ulrike Krieg-Holz, Christian Schuschnig, Franz Matthies, Benjamin Redling, and Udo Hahn. 2016. CodE Alltag: A German-Language E-Mail Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2543–2550, Portorož, Slovenia. European Language Resources Association (ELRA).
 - Cite (Informal):
 - CodE Alltag: A German-Language E-Mail Corpus (Krieg-Holz et al., LREC 2016)
 - PDF:
 - https://preview.aclanthology.org/landing_page/L16-1404.pdf