Sarah Jablotschkin


2025

pdf bib
Coreference in simplified German: Linguistic features and challenges of automatic annotation
Sarah Jablotschkin | Ekaterina Lapshinova-Koltunski | Heike Zinsmeister
Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference

In this paper, we analyse coreference annotation of the German language, focussing on the phenomenon of simplification, that is, the tendency to use words and constructions that are assumed to be easier perceived, understood, or produced. Simplification is one of the tools used by language users in order to optimise communication effectively. We are interested in how simplification is reflected in coreference in two different language products exposed to the phenomena of simplification: simultaneous interpreting and Easy German. For this, we automatically annotate simplified texts with coreference. We then evaluate the outputs of automatic annotation. In addition, we also look into quantitative distributions of some coreference features. Our findings show that although the language products under analysis diverge in terms of simplification driving factors, they share some specific coreference features. We also show that this specificity may cause annotation errors in simplified language, e.g. in non-nominal or split antecedents.

pdf bib
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)
Sarah Jablotschkin | Sandra Kübler | Heike Zinsmeister
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)

2024

pdf bib
DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis
Sarah Jablotschkin | Elke Teich | Heike Zinsmeister
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.

pdf bib
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)
Daniel Dakota | Sarah Jablotschkin | Sandra Kübler | Heike Zinsmeister
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)