A Large Annotated Reference Corpus of New High German Poetry

Thomas Haider


Abstract
This paper introduces a large annotated corpus of public domain German poetry, covering the time period from 1600 to the 1920s with 65k poems. We describe how the corpus was compiled, how it was cleaned (including duplicate detection), and how it looks now in terms of size, format, temporal distribution, and automatic annotation. Besides metadata, the corpus contains reliable annotation of tokens, syllables, part-of-speech, and meter and verse measure. Finally, we give some statistics on the annotation and an overview of other poetry corpora.
Anthology ID:
2024.lrec-main.59
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
677–683
Language:
URL:
https://aclanthology.org/2024.lrec-main.59
DOI:
Bibkey:
Cite (ACL):
Thomas Haider. 2024. A Large Annotated Reference Corpus of New High German Poetry. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 677–683, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Large Annotated Reference Corpus of New High German Poetry (Haider, LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2024.lrec-main.59.pdf