This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
KerstinEckart
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
DIRNDL is a spoken and written corpus based on German radio news, which features coreference and information-status annotation (including bridging anaphora and their antecedents), as well as prosodic information. We have recently extended DIRNDL with a fine-grained two-dimensional information status labeling scheme. We have also applied a state-of-the-art part-of-speech and morphology tagger to the corpus, as well as highly accurate constituency and dependency parsers. In the light of this development we believe that DIRNDL is an interesting resource for NLP researchers working on automatic coreference and bridging resolution. In order to enable and promote usage of the data, we make it available for download in an accessible tabular format, compatible with the formats used in the CoNLL and SemEval shared tasks on automatic coreference resolution.
In this paper, we present a database-supported corpus study where we combine automatically obtained linguistic information from a statistical dependency parser, namely the occurrence of a dative argument, with predictions from a theory on the argument structure of German particle verbs with """"nach"""". The theory predicts five readings of """"nach"""" which behave differently with respect to dative licensing in their argument structure. From a huge German web corpus, we extracted sentences for a subset of """"nach""""-particle verbs for which no dative is expected by the theory. Making use of a relational database management system, we bring together the corpus sentences and the lemmas manually annotated along the lines of the theory. We validate the theoretical predictions against the syntactic structure of the corpus sentences, which we obtained from a statistical dependency parser. We find that, in principle, the theory is borne out by the data, however, manual error analysis reveals cases for which the theory needs to be extended.
Depending on the nature of a linguistic theory, empirical investigations of its soundness may focus on corpus studies related to lexical, syntactic, semantic or other phenomena. Especially work in research networks usually comprises analyses of different levels of description, where each one must be as reliable as possible when the same sentences and texts are investigated under very different perspectives. This paper describes an infrastructure that interfaces an analysis tool for multi-level annotation with a generic relational database. It supports three dimensions of analysis-handling and thereby builds an integrated environment for quality assurance in corpus based linguistic analysis: a vertical dimension relating analysis components in a pipeline, a horizontal dimension taking alternative results of the same analysis level into account and a temporal dimension to follow up cases where analyses for the same input have been produced with different versions of a tool. As an example we give a detailed description of a typical workflow for the vertical dimension.
In the framework of the preparation of linguistic web services for corpus processing, the need for a representation format was felt, which supports interoperability between different web services in a corpus processing pipeline, but also provides a well-defined interface to both, legacy tools and their data formats and upcoming international standards. We present the D-SPIN text corpus format, TCF, which was designed for this purpose. It is a stand-off XML format, inspired by the philosophy of the emerging standards LAF (Linguistic Annotation Framework) and its ``instances'' MAF for morpho-syntactic annotation and SynAF for syntactic annotation. Tools for the exchange with existing (best practice) formats are available, and a converter from MAF to TCF is being tested in spring 2010. We describe the usage scenario where TCF is embedded and the properties and architecture of TCF. We also give examples of TCF encoded data and describe the aspects of syntactic and semantic interoperability already addressed.
Data models and encoding formats for syntactically annotated text corpora need to deal with syntactic ambiguity; underspecified representations are particularly well suited for the representation of ambiguous data because they allow for high informational efficiency. We discuss the issue of being informationally efficient, and the trade-off between efficient encoding of linguistic annotations and complete documentation of linguistic analyses. The main topic of this article is a data model and an encoding scheme based on LAF/GrAF (Ide and Romary, 2006; Ide and Suderman, 2007) which provides a flexible framework for encoding underspecified representations. We show how a set of dependency structures and a set of TiGer graphs (Brants et al., 2002) representing the readings of an ambiguous sentence can be encoded, and we discuss basic issues in querying corpora which are encoded using the framework presented here.