Jiří Mírovský


2024

pdf
Announcing the Prague Discourse Treebank 3.0
Pavlína Synková | Jiří Mírovský | Lucie Poláková | Magdaléna Rysová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the Prague Discourse Treebank 3.0 – a new version of the annotation of discourse relations marked by primary and secondary discourse connectives in the data of the Prague Dependency Treebank. Compared to the previous version (PDiT 2.0), the version 3.0 comes with three types of major updates: (i) it brings a largely revised annotation of discourse relations: pragmatic relations have been thoroughly reworked, many inconsistencies across all discourse types have been fixed and previously unclear cases marked in annotators’ comments have been resolved, (ii) it achieves consistency with a Lexicon of Czech Discourse Connectives (CzeDLex), and (iii) it provides the data not only in its native format (Prague Markup Language, discourse relations annotated at the top of the dependency trees), but also in the Penn Discourse Treebank 3.0 format (plain text plus a stand-off discourse annotation) and sense taxonomy. PDiT 3.0 contains 21,662 discourse relations (plus 445 list relations) in 49 thousand sentences.

pdf
Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank
Jiří Mírovský | Pavlína Synková | Lucie Polakova | Marie Paclíková
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present a cost-effective method for obtaining a high-quality annotation of explicit discourse relations in the Czech part of the Prague Czech–English Dependency Treebank, a corpus of almost 50 thousand sentences coming from the Czech translation of the Wall Street Journal part of the Penn Treebank. We use three different sources of information and combine them to obtain the discourse annotation: (i) annotation projection from the Penn Discourse Treebank 3.0, (ii) manual tectogrammatical (deep syntax) representation of sentences of the corpus, and (iii) the Lexicon of Czech Discourse Connectives CzeDLex. After solving as many discrepancies as possible automatically, the final discourse annotation is achieved by manual inspection of the remaining problematic cases. The discourse annotation of the corpus will be available both in the Prague format (on top of tectogrammatical trees) with the Prague taxonomy of discourse types, and in the Penn format (on plain texts) with the Penn Discourse Treebank 3.0 sense taxonomy.

pdf
Developing a Rhetorical Structure Theory Treebank for Czech
Lucie Polakova | Jiří Mírovský | Šárka Zikánová | Eva Hajicova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce the first version of the Czech RST Discourse Treebank, a collection of Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST), a global coherence model proposed by Mann and Thompson (1988). Each document in the corpus is represented as a single tree-like structure, where discourse units are interconnected through hierarchical rhetorical relations and their relative importance for the main purpose of a text is modeled by the nuclearity principle. The treebank is freely available in the LINDAT/CLARIAH-CZ repository under the Creative Commons license; for some documents, it includes two gold annotations representing divergent yet relevant interpretations. The paper outlines the annotation process, provides corpus statistics and evaluation, and discusses the issue of consistency associated with the global level of textual interpretation. In general, good agreement on the structure and labeling could be achieved on the lowest, local tree level and on the identification of the most central (nuclear) elementary discourse units. Disagreements mostly concerned segmentation and, in the structure, differences in the stepwise process of linking the largest text blocks. The project contributes to the advancement of RST research and its application to real-world text analysis challenges.

2022

pdf
Advantages of a Complex Multilayer Annotation Scheme: The Case of the Prague Dependency Treebank
Eva Hajicova | Marie Mikulová | Barbora Štěpánková | Jiří Mírovský
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022

Recently, many corpora have been developed that contain multiple annotations of various linguistic phenomena, from morphological categories of words through the syntactic structure of sentences to discourse and coreference relations in texts. Discussions are ongoing on an appropriate annotation scheme for a large amount of diverse information. In our contribution we express our conviction that a multilayer annotation scheme offers to view the language system in its complexity and in the interaction of individual phenomena and that there are at least two aspects that support such a scheme: (i) A multilayer annotation scheme makes it possible to use the annotation of one layer to design the annotation of another layer(s) both conceptually and in a form of a pre-annotation procedure or annotation checking rules. (ii) A multilayer annotation scheme presents a reliable ground for corpus studies based on features across the layers. These aspects are demonstrated on the case of the Prague Dependency Treebank. Its multilayer annotation scheme withstood the test of time and serves well also for complex textual annotations, in which earlier morpho-syntactic annotations are advantageously used. In addition to a reference to the previous projects that utilise its annotation scheme, we present several current investigations.

pdf
Annotating Attribution in Czech News Server Articles
Barbora Hladka | Jiří Mírovský | Matyáš Kopp | Václav Moravec
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper focuses on detection of sources in the Czech articles published on a news server of Czech public radio. In particular, we search for attribution in sentences and we recognize attributed sources and their sentence context (signals). We organized a crowdsourcing annotation task that resulted in a data set of 2,167 stories with manually recognized signals and sources. In addition, the sources were classified into the classes of named and unnamed sources.

2020

pdf
GeCzLex: Lexicon of Czech and German Anaphoric Connectives
Lucie Poláková | Kateřina Rysová | Magdaléna Rysová | Jiří Mírovský
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce the first version of GeCzLex, an online electronic resource for translation equivalents of Czech and German discourse connectives. The lexicon is one of the outcomes of the research on anaphoricity and long-distance relations in discourse, it contains at present anaphoric connectives (ACs) for Czech and German connectives, and further their possible translations documented in bilingual parallel corpora (not necessarily anaphoric). As a basis, we use two existing monolingual lexicons of connectives: the Lexicon of Czech Discourse Connectives (CzeDLex) and the Lexicon of Discourse Markers (DiMLex) for German, interlink their relevant entries via semantic annotation of the connectives (according to the PDTB 3 sense taxonomy) and statistical information of translation possibilities from the Czech and German parallel data of the InterCorp project. The lexicon is, as far as we know, the first bilingual inventory of connectives with linkage on the level of individual entries, and a first attempt to systematically describe devices engaged in long-distance, non-local discourse coherence. The lexicon is freely available under the Creative Commons License.

pdf
CzeDLex 0.6 and its Representation in the PML-TQ
Jiří Mírovský | Lucie Poláková | Pavlína Synková
Proceedings of the Twelfth Language Resources and Evaluation Conference

CzeDLex is an electronic lexicon of Czech discourse connectives with its data coming from a large treebank annotated with discourse relations. Its new version CzeDLex 0.6 (as compared with the previous version 0.5, which was published in 2017) is significantly larger with respect to manually processed entries. Also, its structure has been modified to allow for primary connectives to appear with multiple entries for a single discourse sense. The lexicon comes in several formats, being both human and machine readable, and is available for searching in PML Tree Query, a user-friendly and powerful search tool for all kinds of linguistically annotated treebanks. The main purpose of this paper/demo is to present the new version of the lexicon and to demonstrate possibilities of mining various types of information from the lexicon using PML Tree Query; we present several examples of search queries over the lexicon data along with their results. The new version of the lexicon, CzeDLex 0.6, is available on-line and was officially released in December 2019 under the Creative Commons License.

2019

pdf
Ordering of Adverbials of Time and Place in Grammars and in an Annotated English-Czech Parallel Corpus
Eva Hajičová | Jiří Mírovský | Kateřina Rysová
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf
Discourse Coherence Through the Lens of an Annotated Text Corpus: A Case Study
Eva Hajičová | Jiří Mírovský
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
EvalD Reference-Less Discourse Evaluation for WMT18
Ondřej Bojar | Jiří Mírovský | Kateřina Rysová | Magdaléna Rysová
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present the results of automatic evaluation of discourse in machine translation (MT) outputs using the EVALD tool. EVALD was originally designed and trained to assess the quality of human writing, for native speakers and foreign-language learners. MT has seen a tremendous leap in translation quality at the level of sentences and it is thus interesting to see if the human-level evaluation is becoming relevant.

2017

pdf
Introducing EVALD – Software Applications for Automatic Evaluation of Discourse in Czech
Kateřina Rysová | Magdaléna Rysová | Jiří Mírovský | Michal Novák
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In the paper, we introduce two software applications for automatic evaluation of coherence in Czech texts called EVALD – Evaluator of Discourse. The first one – EVALD 1.0 – evaluates texts written by native speakers of Czech on a five-step scale commonly used at Czech schools (grade 1 is the best, grade 5 is the worst). The second application is EVALD 1.0 for Foreigners assessing texts by non-native speakers of Czech using six-step scale (A1–C2) according to CEFR. Both appli-cations are available online at https://lindat.mff.cuni.cz/services/evald-foreign/.

pdf
Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus
Pavlína Synková | Magdaléna Rysová | Lucie Poláková | Jiří Mírovský
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

2016

pdf
Coreference in Prague Czech-English Dependency Treebank
Anna Nedoluzhko | Michal Novák | Silvie Cinková | Marie Mikulová | Jiří Mírovský
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present coreference annotation on parallel Czech-English texts of the Prague Czech-English Dependency Treebank (PCEDT). The paper describes innovations made to PCEDT 2.0 concerning coreference, as well as coreference information already present there. We characterize the coreference annotation scheme, give the statistics and compare our annotation with the coreference annotation in Ontonotes and Prague Dependency Treebank for Czech. We also present the experiments made using this corpus to improve the alignment of coreferential expressions, which helps us to collect better statistics of correspondences between types of coreferential relations in Czech and English. The corpus released as PCEDT 2.0 Coref is publicly available.

pdf
Searching in the Penn Discourse Treebank Using the PML-Tree Query
Jiří Mírovský | Lucie Poláková | Jan Štěpánek
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The PML-Tree Query is a general, powerful and user-friendly system for querying richly linguistically annotated treebanks. The paper shows how the PML-Tree Query can be used for searching for discourse relations in the Penn Discourse Treebank 2.0 mapped onto the syntactic annotation of the Penn Treebank.

pdf
Automatic evaluation of surface coherence in L2 texts in Czech
Kateřina Rysová | Magdaléna Rysová | Jiří Mírovský
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016)

pdf
Designing CzeDLex – A Lexicon of Czech Discourse Connectives
Jiří Mírovský | Pavlína Jínová | Magdaléna Rysová | Lucie Poláková
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf
Discourse Relations in the Prague Dependency Treebank 3.0
Jiří Mírovský | Pavlína Jínová | Lucie Poláková
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

pdf
Genres in the Prague Discourse Treebank
Lucie Poláková | Pavlína Jínová | Jiří Mírovský
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the project of classification of Prague Discourse Treebank documents (Czech journalistic texts) for their genres. Our main interest lies in opening the possibility to observe how text coherence is realized in different types (in the genre sense) of language data and, in the future, in exploring the ways of using genres as a feature for multi-sentence-level language technologies. In the paper, we first describe the motivation and the concept of the genre annotation, and briefly introduce the Prague Discourse Treebank. Then, we elaborate on the process of manual annotation of genres in the treebank, from the annotators’ manual work to post-annotation checks and to the inter-annotator agreement measurements. The annotated genres are subsequently analyzed together with discourse relations (already annotated in the treebank) ― we present distributions of the annotated genres and results of studying distinctions of distributions of discourse relations across the individual genres.

pdf
Valency and Word Order in Czech — A Corpus Probe
Kateřina Rysová | Jiří Mírovský
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a part of broader research on word order aiming at finding factors influencing word order in Czech (i.e. in an inflectional language) and their intensity. The main aim of the paper is to test a hypothesis that obligatory adverbials (in terms of the valency) follow the non-obligatory (i.e. optional) ones in the surface word order. The determined hypothesis was tested by creating a list of features for the decision trees algorithm and by searching in data of the Prague Dependency Treebank using the search tool PML Tree Query. Apart from the valency, our experiment also evaluates importance of several other features, such as argument length and deep syntactic function. Neither of the used methods has proved the given hypothesis but according to the results, there are several other features that influence word order of contextually non-bound free modifiers of a verb in Czech, namely position of the sentence in the text, form and length of the verb modifiers (the whole subtrees), and the semantic dependency relation (functor) of the modifiers.

pdf bib
Use of Coreference in Automatic Searching for Multiword Discourse Markers in the Prague Dependency Treebank
Magdaléna Rysová | Jiří Mírovský
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

2013

pdf
(Pre-)Annotation of Topic-Focus Articulation in Prague Czech-English Dependency Treebank
Jiří Mírovský | Kateřina Rysová | Magdaléna Rysová | Eva Hajičová
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Introducing the Prague Discourse Treebank 1.0
Lucie Poláková | Jiří Mírovský | Anna Nedoluzhko | Pavlína Jínová | Šárka Zikánová | Eva Hajičová
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Subordinators with Elaborative Meanings in Czech and English
Pavlína Jínová | Lucie Poláková | Jiří Mírovský
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf
Annotators’ Certainty and Disagreements in Coreference and Bridging Annotation in Prague Dependency Treebank
Anna Nedoluzhko | Jiří Mírovský
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf
How Dependency Trees and Tectogrammatics Help Annotating Coreference and Bridging Relations in Prague Dependency Treebank
Anna Nedoluzhko | Jiří Mírovský
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf
A Case Study of a Free Word Order
Vladislav Kuboň | Markéta Lopatková | Jiří Mírovský
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)

2012

pdf
Does Tectogrammatics Help the Annotation of Discourse?
Jiří Mírovský | Pavlína Jínová | Lucie Poláková
Proceedings of COLING 2012: Posters

pdf
Interplay of Coreference and Discourse Relations: Discourse Connectives with a Referential Component
Lucie Poláková | Pavlína Jínová | Jiří Mírovský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This contribution explores the subgroup of text structuring expressions with the form preposition + demonstrative pronoun, thus it is devoted to an aspect of the interaction of coreference relations and relations signaled by discourse connectives (DCs) in a text. The demonstrative pronoun typically signals a referential link to an antecedent, whereas the whole expression can, but does not have to, carry a discourse meaning in sense of discourse connectives. We describe the properties of these phrases/expressions with regard to their antecedents, their position among the text-structuring language means and their features typical for the “connective function” of them compared to their “non-connective function”. The analysis is carried out on Czech data from the approx. 50,000 sentences of the Prague Dependency Treebank 2.0, directly on the syntactic trees. We explore the characteristics of these phrases/expressions discovered during two projects: the manual annotation of 1, coreference relations (Nedoluzhko et al. 2011) and 2, discourse connectives, their scopes and meanings (Mladová et al. 2008).

pdf bib
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects
Eva Hajičová | Lucie Poláková | Jiří Mírovský
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects

pdf
Semi-Automatic Annotation of Intra-Sentential Discourse Relations in PDT
Pavlína Jínová | Jiří Mírovský | Lucie Poláková
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects

2010

pdf
Connective-Based Measuring of the Inter-Annotator Agreement in the Annotation of Discourse in PDT
Jiří Mírovský | Lucie Mladová | Šárka Zikánová
Coling 2010: Posters

pdf
Annotation Tool for Discourse in PDT
Jiří Mírovský | Lucie Mladová | Zdeněk Žabokrtský
Coling 2010: Demonstrations

pdf
Annotation Tool for Extended Textual Coreference and Bridging Anaphora
Jiří Mírovský | Petr Pajas | Anna Nedoluzhko
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present an annotation tool for the extended textual coreference and the bridging anaphora in the Prague Dependency Treebank 2.0 (PDT 2.0). After we very briefly describe the annotation scheme, we focus on details of the annotation process from the technical point of view. We present the way of helping the annotators by several useful features implemented in the annotation tool, such as a possibility to combine surface and deep syntactic representation of sentences during the annotation, an automatic maintaining of the coreferential chain, underlining candidates for antecedents, etc. For studying differences among parallel annotations, the tool offers a simultaneous depicting of several annotations of the same data. The annotation tool can be used for other corpora too, as long as they have been transformed to the PML format. We present modifications of the tool for working with the coreference relations on other layers of language description, namely on the analytical layer and the morphological layer of PDT.

pdf
Typical Cases of Annotators’ Disagreement in Discourse Annotations in Prague Dependency Treebank
Šárka Zikánová | Lucie Mladová | Jiří Mírovský | Pavlína Jínová
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present the first results of the parallel Czech discourse annotation in the Prague Dependency Treebank 2.0. Having established an annotation scenario for capturing semantic relations crossing the sentence boundary in a discourse, and having annotated the first sections of the treebank according to these guidelines, we report now on the results of the first evaluation of these manual annotations. We give an overview of the process of the annotation itself, which we believe is to a large degree language-independent and therefore accessible to any discourse researcher. Next, we describe the inter-annotator agreement measurement, and, most importantly, we classify and analyze the most common types of annotators’ disagreement and propose solutions for the next phase of the annotation. The annotation is carried out on dependency trees (on the tectogrammatical layer), this approach is quite novel and it brings us some advantages when interpreting the syntactic structure of the discourse units.

2009

pdf
Play the Language: Play Coreference
Barbora Hladká | Jiří Mírovský | Pavel Schlesinger
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf
Designing a Language Game for Collecting Coreference Annotation
Barbora Hladká | Jiří Mírovský | Pavel Schlesinger
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
The Coding Scheme for Annotating Extended Nominal Coreference and Bridging Anaphora in the Prague Dependency Treebank
Anna Nedoluzhko | Jiří Mírovský | Petr Pajas
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf
Netgraph – Making Searching in Treebanks Easy
Jiří Mírovský
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf
Does Netgraph Fit Prague Dependency Treebank?
Jiří Mírovský
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

On many examples we present a query language of Netgraph - a fully graphical tool for searching in the Prague Dependency Treebank 2.0. To demonstrate that the query language fits the treebank well, we study an annotation manual for the most complex layer of the treebank - the tectogrammatical layer - and show that linguistic phenomena annotated on the layer can be searched for using the query language.

pdf
PDT 2.0 Requirements on a Query Language
Jiří Mírovský
Proceedings of ACL-08: HLT