This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
PetrPajas
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.
This paper presents a system for querying treebanks in a uniform way. The system is able to work with both dependency and constituency based treebanks in any language. We demonstrate its abilities on 11 different treebanks. The query language used by the system provides many features not available in other existing systems while still keeping the performance efficient. The paper also describes the conversion of ten treebanks into a common XML-based format used by the system, touching the question of standards and formats. The paper then shows several examples of linguistically interesting questions that the system is able to answer, for example browsing verbal clauses without subjects or extraposed relative clauses, generating the underlying grammar in a constituency treebank, searching for non-projective edges in a dependency treebank, or word-order typology of a language based on the treebank. The performance of several implementations of the system is also discussed by measuring the time requirements of some of the queries.
We present an annotation tool for the extended textual coreference and the bridging anaphora in the Prague Dependency Treebank 2.0 (PDT 2.0). After we very briefly describe the annotation scheme, we focus on details of the annotation process from the technical point of view. We present the way of helping the annotators by several useful features implemented in the annotation tool, such as a possibility to combine surface and deep syntactic representation of sentences during the annotation, an automatic maintaining of the coreferential chain, underlining candidates for antecedents, etc. For studying differences among parallel annotations, the tool offers a simultaneous depicting of several annotations of the same data. The annotation tool can be used for other corpora too, as long as they have been transformed to the PML format. We present modifications of the tool for working with the coreference relations on other layers of language description, namely on the analytical layer and the morphological layer of PDT.
The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Ourapproach to annotation is based on the Prague Dependency Treebank, which serves as an excellent model due to the similarity of the languages, the existence of a detailed annotation guide and an annotation editor. The initial treebank contains a portion of theMULTEXT-East parallel word-level annotated corpus, namely the firstpart of the Slovene translation of Orwell's 1984. This corpus was first parsed automatically, to arrive at the initial analytic level dependency trees. These were then hand corrected using the tree editorTrEd; simultaneously, the Czech annotation manual was modified forSlovene. The current version is available in XML/TEI, as well asderived formats, and has been used in a comparative evaluation using the MALT parser, and as one of the languages present in the CoNLL-Xshared task on dependency parsing. The paper also discusses further work, in the first instance the composition of the corpus to be annotated next.