Corpus REDEWIEDERGABE

Annelen Brunner, Stefan Engelberg, Fotis Jannidis, Ngoc Duyen Tanja Tu, Lukas Weimer


Abstract
This article presents corpus REDEWIEDERGABE, a German-language historical corpus with detailed annotations for speech, thought and writing representation (ST&WR). With approximately 490,000 tokens, it is the largest resource of its kind. It can be used to answer literary and linguistic research questions and serve as training material for machine learning. This paper describes the composition of the corpus and the annotation structure, discusses some methodological decisions and gives basic statistics about the forms of ST&WR found in this corpus.
Anthology ID:
2020.lrec-1.100
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
803–812
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.100
DOI:
Bibkey:
Cite (ACL):
Annelen Brunner, Stefan Engelberg, Fotis Jannidis, Ngoc Duyen Tanja Tu, and Lukas Weimer. 2020. Corpus REDEWIEDERGABE. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 803–812, Marseille, France. European Language Resources Association.
Cite (Informal):
Corpus REDEWIEDERGABE (Brunner et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.100.pdf