Frank Puppe

2026

Human-in-the-Loop Mass Transcription and Ground Truth Annotation for Challenging Historical Documents
Norbert Fischer | Frank Puppe
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Challenging historical documents still pose significant difficulties for fully automatic layout detection and text recognition, requiring lengthy, demanding correction. We describe our experiences with complex layouts and present our workflow with AdaptOCR, a web-based annotation tool designed to facilitate the efficient transcription and ground-truth annotation of demanding historical documents. Addressing the limitations of existing solutions, AdaptOCR prioritizes a streamlined workflow with an integrated trainable layout and OCR pipeline. The tool uses the PAGE standard to represent document structure and enables the annotation of baselines, regions, text lines and the correction of their transcriptions providing automatic OCR invocation and dictionary-based error detection. Furthermore, it supports flexible annotations with custom element types and attributes to cater to different project requirements. We demonstrate the effectiveness of the workflow and tool in two demanding applications: The transcription of a large corpus of historical printings and the detection / annotation of handwritten artifacts within the private library of the Grimm brothers. In addition, we evaluate the dictionary-based correction and assess the efficiency improvements using AdaptOCR in a pilot study.

2024

pdf bib abs

ADEA: An Argumentative Dialogue Dataset on Ethical Issues Concerning Future A.I. Applications
Christian Hauptmann | Adrian Krenzer | Antonia Krause | Frank Puppe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Introducing ADEA: a German dataset that captures online dialogues and focuses on ethical issues related to future AI applications. This dataset, which includes over 2800 labeled user utterances on four different topics, is specifically designed for the training of chatbots that can navigate the complexities of real-world ethical AI conversations. The creation of these dialogues is the result of two carefully conducted studies in which university students interacted with an argumentative dialogue system. A fundamental part of our methodology is the use of German argument graphs. These graphs not only form the knowledge base of the dialogue system but also serve as an effective annotation scheme for the dialogues. Apart from the introduction of the dataset and the argument graphs, we provide a preliminary benchmark using GPT-4 via the OpenAI API. This provides researchers with a concrete reference point while demonstrating the potential of our dataset. We make our dataset and argument graphs available at https://github.com/HaupChris/ADEA-Dialogue-Dataset.

2021

pdf bib abs

The FairyNet Corpus - Character Networks for German Fairy Tales
David Schmidt | Albin Zehe | Janne Lorenzen | Lisa Sergel | Sebastian Düker | Markus Krug | Frank Puppe
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents a data set of German fairy tales, manually annotated with character networks which were obtained with high inter rater agreement. The release of this corpus provides an opportunity of training and comparing different algorithms for the extraction of character networks, which so far was barely possible due to heterogeneous interests of previous researchers. We demonstrate the usefulness of our data set by providing baseline experiments for the automatic extraction of character networks, applying a rule-based pipeline as well as a neural approach, and find the neural approach outperforming the rule-approach in most evaluation settings.

pdf bib abs

This paper introduces the novel task of scene segmentation on narrative texts and provides an annotated corpus, a discussion of the linguistic and narrative properties of the task and baseline experiments towards automatic solutions. A scene here is a segment of the text where time and discourse time are more or less equal, the narration focuses on one action and location and character constellations stay the same. The corpus we describe consists of German-language dime novels (550k tokens) that have been annotated in parallel, achieving an inter-annotator agreement of gamma = 0.7. Baseline experiments using BERT achieve an F1 score of 24%, showing that the task is very challenging. An automatic scene segmentation paves the way towards processing longer narrative texts like tales or novels by breaking them down into smaller, coherent and meaningful parts, which is an important stepping stone towards the reconstruction of plot in Computational Literary Studies but also can serve to improve tasks like coreference resolution.

Frank Puppe

2026

2024

2021

2015

2014

Co-authors

Venues