Petr Pajas

2012

We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.

2010

pdf abs
Querying Diverse Treebanks in a Uniform Way
Jan Štěpánek | Petr Pajas
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents a system for querying treebanks in a uniform way. The system is able to work with both dependency and constituency based treebanks in any language. We demonstrate its abilities on 11 different treebanks. The query language used by the system provides many features not available in other existing systems while still keeping the performance efficient. The paper also describes the conversion of ten treebanks into a common XML-based format used by the system, touching the question of standards and formats. The paper then shows several examples of linguistically interesting questions that the system is able to answer, for example browsing verbal clauses without subjects or extraposed relative clauses, generating the underlying grammar in a constituency treebank, searching for non-projective edges in a dependency treebank, or word-order typology of a language based on the treebank. The performance of several implementations of the system is also discussed by measuring the time requirements of some of the queries.

pdf abs
Annotation Tool for Extended Textual Coreference and Bridging Anaphora
Jiří Mírovský | Petr Pajas | Anna Nedoluzhko
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present an annotation tool for the extended textual coreference and the bridging anaphora in the Prague Dependency TreebankÂ 2.0 (PDT 2.0). After we very briefly describe the annotation scheme, we focus on details of the annotation process from the technical point of view. We present the way of helping the annotators by several useful features implemented in the annotation tool, such as a possibility to combine surface and deep syntactic representation of sentences during the annotation, an automatic maintaining of the coreferential chain, underlining candidates for antecedents, etc. For studying differences among parallel annotations, the tool offers a simultaneous depicting of several annotations of the same data. The annotation tool can be used for other corpora too, as long as they have been transformed to the PML format. We present modifications of the tool for working with the coreference relations on other layers of language description, namely on the analytical layer and the morphological layer of PDT.

2009

pdf
The Coding Scheme for Annotating Extended Nominal Coreference and Bridging Anaphora in the Prague Dependency Treebank
Anna Nedoluzhko | Jiří Mírovský | Petr Pajas
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
System for Querying Syntactically Annotated Corpora
Petr Pajas | Jan Štěpánek
Proceedings of the ACL-IJCNLP 2009 Software Demonstrations

2008

pdf
Recent Advances in a Feature-Rich Framework for Treebank Annotation
Petr Pajas | Jan Štěpánek
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer
Zdeněk Žabokrtský | Jan Ptáček | Petr Pajas
Proceedings of the Third Workshop on Statistical Machine Translation

2006

The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Ourapproach to annotation is based on the Prague Dependency Treebank, which serves as an excellent model due to the similarity of the languages, the existence of a detailed annotation guide and an annotation editor. The initial treebank contains a portion of theMULTEXT-East parallel word-level annotated corpus, namely the firstpart of the Slovene translation of Orwell's 1984. This corpus was first parsed automatically, to arrive at the initial analytic level dependency trees. These were then hand corrected using the tree editorTrEd; simultaneously, the Czech annotation manual was modified forSlovene. The current version is available in XML/TEI, as well asderived formats, and has been used in a comparative evaluation using the MALT parser, and as one of the languages present in the CoNLL-Xshared task on dependency parsing. The paper also discusses further work, in the first instance the composition of the corpus to be annotated next.