Kerstin Eckart

DIRNDL is a spoken and written corpus based on German radio news, which features coreference and information-status annotation (including bridging anaphora and their antecedents), as well as prosodic information. We have recently extended DIRNDL with a fine-grained two-dimensional information status labeling scheme. We have also applied a state-of-the-art part-of-speech and morphology tagger to the corpus, as well as highly accurate constituency and dependency parsers. In the light of this development we believe that DIRNDL is an interesting resource for NLP researchers working on automatic coreference and bridging resolution. In order to enable and promote usage of the data, we make it available for download in an accessible tabular format, compatible with the formats used in the CoNLL and SemEval shared tasks on automatic coreference resolution.

2012

pdf
Approximating Theoretical Linguistics Classification in Real Data: the Case of German “nach” Particle Verbs
Boris Haselbach | Kerstin Eckart | Wolfgang Seeker | Kurt Eberle | Ulrich Heid
Proceedings of COLING 2012

pdf abs
German nach-Particle Verbs in Semantic Theory and Corpus Data
Boris Haselbach | Wolfgang Seeker | Kerstin Eckart
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present a database-supported corpus study where we combine automatically obtained linguistic information from a statistical dependency parser, namely the occurrence of a dative argument, with predictions from a theory on the argument structure of German particle verbs with """"nach"""". The theory predicts five readings of """"nach"""" which behave differently with respect to dative licensing in their argument structure. From a huge German web corpus, we extracted sentences for a subset of """"nach""""-particle verbs for which no dative is expected by the theory. Making use of a relational database management system, we bring together the corpus sentences and the lemmas manually annotated along the lines of the theory. We validate the theoretical predictions against the syntactic structure of the corpus sentences, which we obtained from a statistical dependency parser. We find that, in principle, the theory is borne out by the data, however, manual error analysis reveals cases for which the theory needs to be extended.

pdf abs
A Tool/Database Interface for Multi-Level Analyses
Kurt Eberle | Kerstin Eckart | Ulrich Heid | Boris Haselbach
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Depending on the nature of a linguistic theory, empirical investigations of its soundness may focus on corpus studies related to lexical, syntactic, semantic or other phenomena. Especially work in research networks usually comprises analyses of different levels of description, where each one must be as reliable as possible when the same sentences and texts are investigated under very different perspectives. This paper describes an infrastructure that interfaces an analysis tool for multi-level annotation with a generic relational database. It supports three dimensions of analysis-handling and thereby builds an integrated environment for quality assurance in corpus based linguistic analysis: a vertical dimension relating analysis components in a pipeline, a horizontal dimension taking alternative results of the same analysis level into account and a temporal dimension to follow up cases where analyses for the same input have been produced with different versions of a tool. As an example we give a detailed description of a typical workflow for the vertical dimension.

2010

pdf abs
A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards
Ulrich Heid | Helmut Schmid | Kerstin Eckart | Erhard Hinrichs
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In the framework of the preparation of linguistic web services for corpus processing, the need for a representation format was felt, which supports interoperability between different web services in a corpus processing pipeline, but also provides a well-defined interface to both, legacy tools and their data formats and upcoming international standards. We present the D-SPIN text corpus format, TCF, which was designed for this purpose. It is a stand-off XML format, inspired by the philosophy of the emerging standards LAF (Linguistic Annotation Framework) and its ``instances'' MAF for morpho-syntactic annotation and SynAF for syntactic annotation. Tools for the exchange with existing (best practice) formats are available, and a converter from MAF to TCF is being tested in spring 2010. We describe the usage scenario where TCF is embedded and the properties and architecture of TCF. We also give examples of TCF encoded data and describe the aspects of syntactic and semantic interoperability already addressed.

pdf
Creating and Exploiting a Resource of Parallel Parses
Christian Chiarcos | Kerstin Eckart | Julia Ritz
Proceedings of the Fourth Linguistic Annotation Workshop

2008

pdf abs
A LAF/GrAF based Encoding Scheme for underspecified Representations of syntactic Annotations.
Manuel Kountz | Ulrich Heid | Kerstin Eckart
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Data models and encoding formats for syntactically annotated text corpora need to deal with syntactic ambiguity; underspecified representations are particularly well suited for the representation of ambiguous data because they allow for high informational efficiency. We discuss the issue of being informationally efficient, and the trade-off between efficient encoding of linguistic annotations and complete documentation of linguistic analyses. The main topic of this article is a data model and an encoding scheme based on LAF/GrAF (Ide and Romary, 2006; Ide and Suderman, 2007) which provides a flexible framework for encoding underspecified representations. We show how a set of dependency structures and a set of TiGer graphs (Brants et al., 2002) representing the readings of an ambiguous sentence can be encoded, and we discuss basic issues in querying corpora which are encoded using the framework presented here.