Heike Zinsmeister

2024

pdf abs
DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis
Sarah Jablotschkin | Elke Teich | Heike Zinsmeister
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.

2023

pdf abs
Personal noun detection for German
Carla Sökefeld | Melanie Andresen | Johanna Binnewitt | Heike Zinsmeister
Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19)

Personal nouns, i.e. common nouns denoting human beings, play an important role in manifesting gender and gender stereotypes in texts, especially for languages with grammatical gender like German. Automatically detecting and extracting personal nouns can thus be of interest to a myriad of different tasks such as minimizing gender bias in language models and researching gender stereotypes or gender-fair language, but is complicated by the morphological heterogeneity and homonymy of personal and non-personal nouns, which restrict lexicon-based approaches. In this paper, we introduce a classifier created by fine-tuning a transformer model that detects personal nouns in German. Although some phenomena like homonymy and metalinguistic uses are still problematic, the model is able to classify personal nouns with robust accuracy (f1-score: 0.94).

2021

pdf bib
The Impact of Word Embeddings on Neural Dependency Parsing
Benedikt Adelmann | Wolfgang Menzel | Heike Zinsmeister
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

2020

pdf abs
Modeling Ambiguity with Many Annotators and Self-Assessments of Annotator Certainty
Melanie Andresen | Michael Vauth | Heike Zinsmeister
Proceedings of the 14th Linguistic Annotation Workshop

Most annotation efforts assume that annotators will agree on labels, if the annotation categories are well-defined and documented in annotation guidelines. However, this is not always true. For instance, content-related questions such as ‘Is this sentence about topic X?’ are unlikely to elicit the same answer from all annotators. Additional specifications in the guidelines are helpful to some extent, but can soon get overspecified by rules that cannot be justified by a research question. In this study, we model the semantic category ‘illness’ and its use in a gradual way. For this purpose, we (i) ask many annotators (30 votes per item, 960 items) for their opinion in a crowdsourcing experiment, (ii) ask annotators to indicate their certainty with respect to their annotation, and (iii) compare this across two different text types. We show that results of multiple annotations and average annotator certainty correlate, but many ambiguities can only be captured if several people contribute. The annotated data allow us to filter for sentences with high or low agreement and analyze causes of disagreement, thus getting a better understanding of people’s perception of illness—as an example of a semantic category—as well as of the content of our annotated texts.

2018

pdf abs
Survey: Anaphora With Non-nominal Antecedents in Computational Linguistics: a Survey
Varada Kolhatkar | Adam Roussel | Stefanie Dipper | Heike Zinsmeister
Computational Linguistics, Volume 44, Issue 3 - September 2018

This article provides an extensive overview of the literature related to the phenomenon of non-nominal-antecedent anaphora (also known as abstract anaphora or discourse deixis), a type of anaphora in which an anaphor like “that” refers to an antecedent (marked in boldface) that is syntactically non-nominal, such as the first sentence in “It’s way too hot here. That’s why I’m moving to Alaska.” Annotating and automatically resolving these cases of anaphora is interesting in its own right because of the complexities involved in identifying non-nominal antecedents, which typically represent abstract objects such as events, facts, and propositions. There is also practical value in the resolution of non-nominal-antecedent anaphora, as this would help computational systems in machine translation, summarization, and question answering, as well as, conceivably, any other task dependent on some measure of text understanding. Most of the existing approaches to anaphora annotation and resolution focus on nominal-antecedent anaphora, classifying many of the cases where the antecedents are syntactically non-nominal as non-anaphoric. There has been some work done on this topic, but it remains scattered and difficult to collect and assess. With this article, we hope to bring together and synthesize work done in disparate contexts up to now in order to identify fundamental problems and draw conclusions from an overarching perspective. Having a good picture of the current state of the art in this field can help researchers direct their efforts to where they are most necessary. Because of the great variety of theoretical approaches that have been brought to bear on the problem, there is an equally diverse array of terminologies that are used to describe it, so we will provide an overview and discussion of these terminologies. We also describe the linguistic properties of non-nominal-antecedent anaphora, examine previous annotation efforts that have addressed this topic, and present the computational approaches that aim at resolving non-nominal-antecedent anaphora automatically. We close with a review of the remaining open questions in this area and some of our recommendations for future research.

The ARRAU corpus is an anaphorically annotated corpus of English providing rich linguistic information about anaphora resolution. The most distinctive feature of the corpus is the annotation of a wide range of anaphoric relations, including bridging references and discourse deixis in addition to identity (coreference). Other distinctive features include treating all NPs as markables, including non-referring NPs; and the annotation of a variety of morphosyntactic and semantic mention and entity attributes, including the genericity status of the entities referred to by markables. The corpus however has not been extensively used for anaphora resolution research so far. In this paper, we discuss three datasets extracted from the ARRAU corpus to support the three subtasks of the CRAC 2018 Shared Task–identity anaphora resolution over ARRAU-style markables, bridging references resolution, and discourse deixis; the evaluation scripts assessing system performance on those datasets; and preliminary results on these three tasks that may serve as baseline for subsequent research in these phenomena.

2017

pdf abs
Approximating Style by N-gram-based Annotation
Melanie Andresen | Heike Zinsmeister
Proceedings of the Workshop on Stylistic Variation

The concept of style is much debated in theoretical as well as empirical terms. From an empirical perspective, the key question is how to operationalize style and thus make it accessible for annotation and quantification. In authorship attribution, many different approaches have successfully resolved this issue at the cost of linguistic interpretability: The resulting algorithms may be able to distinguish one language variety from the other, but do not give us much information on their distinctive linguistic properties. We approach the issue of interpreting stylistic features by extracting linear and syntactic n-grams that are distinctive for a language variety. We present a study that exemplifies this process by a comparison of the German academic languages of linguistics and literary studies. Overall, our findings show that distinctive n-grams can be related to linguistic categories. The results suggest that the style of German literary studies is characterized by nominal structures and the style of linguistics by verbal ones.

pdf
The Benefit of Syntactic vs. Linear N-grams for Linguistic Description
Melanie Andresen | Heike Zinsmeister
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

The Stuttgart-Tübingen TagSet (STTS) is a de-facto standard for the part-of-speech tagging of German texts. Since its first publication in 1995, STTS has been used in a variety of annotation projects, some of which have adapted the tagset slightly for their specific needs. Recently, the focus of many projects has shifted from the analysis of newspaper text to that of non-standard varieties such as user-generated content, historical texts, and learner language. These text types contain linguistic phenomena that are missing from or are only suboptimally covered by STTS; in a community effort, German NLP researchers have therefore proposed additions to and modifications of the tagset that will handle these phenomena more appropriately. In addition, they have discussed alternative ways of tag assignment in terms of bipartite tags (stem, token) for historical texts and tripartite tags (lexicon, morphology, distribution) for learner texts. In this article, we report on this ongoing activity, addressing methodological issues and discussing selected phenomena and their treatment in the tagset adaptation process.

pdf
Annotating descriptively incomplete language phenomena
Fabian Barteld | Sarah Ihden | Ingrid Schröder | Heike Zinsmeister
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

2013

pdf
Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data
Varada Kolhatkar | Heike Zinsmeister | Graeme Hirst
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf
Annotating Anaphoric Shell Nouns with their Antecedents
Varada Kolhatkar | Heike Zinsmeister | Graeme Hirst
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf abs
The Use of Parallel and Comparable Data for Analysis of Abstract Anaphora in German and English
Stefanie Dipper | Melanie Seiss | Heike Zinsmeister
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Parallel corpora ― original texts aligned with their translations ― are a widely used resource in computational linguistics. Translation studies have shown that translated texts often differ systematically from comparable original texts. Translators tend to be faithful to structures of the original texts, resulting in a """"shining through"""" of the original language preferences in the translated text. Translators also tend to make their translations most comprehensible with the effect that translated texts can be more explicit than their source texts. Motivated by the need to use a parallel resource for cross-linguistic feature induction in abstract anaphora resolution, this paper investigates properties of English and German texts in the Europarl corpus, taking into account both general features such as sentence length as well as task-dependent features such as the distribution of demonstrative noun phrases. The investigation is based on the entire Europarl corpus as well as on a small subset thereof, which has been manually annotated. The results indicate English translated texts are sufficiently """"authentic"""" to be used as training data for anaphora resolution; results for German texts are less conclusive, though.