Julia Rakers


2022

pdf bib
How GermaParl Evolves: Improving Data Quality by Reproducible Corpus Preparation and User Involvement
Andreas Blaette | Julia Rakers | Christoph Leonhardt
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

The development and curation of large-scale corpora of plenary debates requires not only care and attention to detail when the data is created but also effective means of sustainable quality control. This paper makes two contributions: Firstly, it presents an updated version of the GermaParl corpus of parliamentary debates in the German *Bundestag*. Secondly, it shows how the corpus preparation pipeline is designed to serve the quality of the resource by facilitating effective community involvement. Centered around a workflow which combines reproducibility, transparency and version control, the pipeline allows for continuous improvements to the corpus.