How GermaParl Evolves: Improving Data Quality by Reproducible Corpus Preparation and User Involvement

Andreas Blaette, Julia Rakers, Christoph Leonhardt


Abstract
The development and curation of large-scale corpora of plenary debates requires not only care and attention to detail when the data is created but also effective means of sustainable quality control. This paper makes two contributions: Firstly, it presents an updated version of the GermaParl corpus of parliamentary debates in the German *Bundestag*. Secondly, it shows how the corpus preparation pipeline is designed to serve the quality of the resource by facilitating effective community involvement. Centered around a workflow which combines reproducibility, transparency and version control, the pipeline allows for continuous improvements to the corpus.
Anthology ID:
2022.parlaclarin-1.2
Volume:
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Darja Fišer, Maria Eskevich, Jakob Lenardič, Franciska de Jong
Venue:
ParlaCLARIN
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7–15
Language:
URL:
https://aclanthology.org/2022.parlaclarin-1.2
DOI:
Bibkey:
Cite (ACL):
Andreas Blaette, Julia Rakers, and Christoph Leonhardt. 2022. How GermaParl Evolves: Improving Data Quality by Reproducible Corpus Preparation and User Involvement. In Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pages 7–15, Marseille, France. European Language Resources Association.
Cite (Informal):
How GermaParl Evolves: Improving Data Quality by Reproducible Corpus Preparation and User Involvement (Blaette et al., ParlaCLARIN 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2022.parlaclarin-1.2.pdf