Pavlína Synková
2026
Presenting the Prague Discourse Treebank 4.0
Jiří Mírovský | Pavlína Synková
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Jiří Mírovský | Pavlína Synková
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The Prague Discourse Treebank 4.0 is a large genre-diversified language resource with annotation of discourse relations marked by explicit connectives in Czech texts. It consists of 175 thousand sentences with 82 thousand discourse relations. We present the treebank as well as the methods used during the annotation of its individual parts, some of which were annotated fully manually, others using cost-effective partially automatic methods, achieving a comparable quality. The discourse annotation is available in two formats and theoretical frameworks: the Prague discourse annotation on top of deep syntax dependency trees, and the Penn Discourse Treebank style on top of plain texts, using both discourse type/sense taxonomies in both formats. The corpus is publicly and freely available, offering a valuable resource for linguistic research and natural language processing tasks.
DReUD: Discourse Relations in Universal Dependencies
Jiří Mírovský | Pavlína Synková
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Jiří Mírovský | Pavlína Synková
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present a proposal for an annotation scheme and data representation of shallow discourse relations annotation in the Universal Dependencies (UD) framework, as a theoretically appropriate and also practically oriented extension of the established morphosyntactic analysis. We outline the design requirements for the annotation scheme, encompassing simplicity, comprehensibility, theoretical grounding, practical applicability and technical robustness, while accommodating the specific constraints of shallow discourse analysis. At the same time, we present a work-in-progress baseline version of DReUD (Discourse Relations in Universal Dependencies), a modular shallow discourse parser for Universal Dependencies as a command-line program, a web client and a REST API service for Czech and English, designed for a seamless and rapid integration of discourse relations analysis both in the theoretical research and in NLP applications.
Prague Dependency Treebank - Consolidated 2.0: Enriching a Complex Annotation Scheme
Marie Mikulová | Jiří Mírovský | Milan Straka | Pavlína Synková | Jan Štěpánek | Barbora Štěpánková | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Marie Mikulová | Jiří Mírovský | Milan Straka | Pavlína Synková | Jan Štěpánek | Barbora Štěpánková | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The Prague Dependency Treebank framework is unique in its attempt to systematically include and link different layers of language, including a meaning representation with several types of inter-sentential phenomena, especially coreference and discourse relation. We present its second consolidated version (PDT-C 2.0), which concludes almost 30-years long project of sustained development of the resource to a uniformly and coherently annotated, genre-diversified, almost 4 million token language resource of Czech language, with accompanying fully compatible lexicons. In addition to continuous linguistic research, the richly linguistically annotated corpus is also widely used in international comparisons of the development of traditional and novel NLP tools as well as in conversions into other formalisms. The corpus and the trained parsers are available under the CC BY-NC-SA licence.
2024
Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank
Jiří Mírovský | Pavlína Synková | Lucie Polakova | Marie Paclíková
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Jiří Mírovský | Pavlína Synková | Lucie Polakova | Marie Paclíková
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present a cost-effective method for obtaining a high-quality annotation of explicit discourse relations in the Czech part of the Prague Czech–English Dependency Treebank, a corpus of almost 50 thousand sentences coming from the Czech translation of the Wall Street Journal part of the Penn Treebank. We use three different sources of information and combine them to obtain the discourse annotation: (i) annotation projection from the Penn Discourse Treebank 3.0, (ii) manual tectogrammatical (deep syntax) representation of sentences of the corpus, and (iii) the Lexicon of Czech Discourse Connectives CzeDLex. After solving as many discrepancies as possible automatically, the final discourse annotation is achieved by manual inspection of the remaining problematic cases. The discourse annotation of the corpus will be available both in the Prague format (on top of tectogrammatical trees) with the Prague taxonomy of discourse types, and in the Penn format (on plain texts) with the Penn Discourse Treebank 3.0 sense taxonomy.
Announcing the Prague Discourse Treebank 3.0
Pavlína Synková | Jiří Mírovský | Lucie Poláková | Magdaléna Rysová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Pavlína Synková | Jiří Mírovský | Lucie Poláková | Magdaléna Rysová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present the Prague Discourse Treebank 3.0 – a new version of the annotation of discourse relations marked by primary and secondary discourse connectives in the data of the Prague Dependency Treebank. Compared to the previous version (PDiT 2.0), the version 3.0 comes with three types of major updates: (i) it brings a largely revised annotation of discourse relations: pragmatic relations have been thoroughly reworked, many inconsistencies across all discourse types have been fixed and previously unclear cases marked in annotators’ comments have been resolved, (ii) it achieves consistency with a Lexicon of Czech Discourse Connectives (CzeDLex), and (iii) it provides the data not only in its native format (Prague Markup Language, discourse relations annotated at the top of the dependency trees), but also in the Penn Discourse Treebank 3.0 format (plain text plus a stand-off discourse annotation) and sense taxonomy. PDiT 3.0 contains 21,662 discourse relations (plus 445 list relations) in 49 thousand sentences.
2020
CzeDLex 0.6 and its Representation in the PML-TQ
Jiří Mírovský | Lucie Poláková | Pavlína Synková
Proceedings of the Twelfth Language Resources and Evaluation Conference
Jiří Mírovský | Lucie Poláková | Pavlína Synková
Proceedings of the Twelfth Language Resources and Evaluation Conference
CzeDLex is an electronic lexicon of Czech discourse connectives with its data coming from a large treebank annotated with discourse relations. Its new version CzeDLex 0.6 (as compared with the previous version 0.5, which was published in 2017) is significantly larger with respect to manually processed entries. Also, its structure has been modified to allow for primary connectives to appear with multiple entries for a single discourse sense. The lexicon comes in several formats, being both human and machine readable, and is available for searching in PML Tree Query, a user-friendly and powerful search tool for all kinds of linguistically annotated treebanks. The main purpose of this paper/demo is to present the new version of the lexicon and to demonstrate possibilities of mining various types of information from the lexicon using PML Tree Query; we present several examples of search queries over the lexicon data along with their results. The new version of the lexicon, CzeDLex 0.6, is available on-line and was officially released in December 2019 under the Creative Commons License.
2017
Signalling Implicit Relations: A PDTB - RST Comparison
Lucie Poláková | Jiˇrí Mírovský | Pavlína Synková
Dialogue & Discourse Volume 8
Lucie Poláková | Jiˇrí Mírovský | Pavlína Synková
Dialogue & Discourse Volume 8
Describing implicit phenomena in discourse is known to be a problematic task, from both theoretical and empirical perspectives. The present article contributes to this topic by a novel comparative analysis of two prominent annotation approaches to discourse relations (coherence relations) that were carried out on the same texts. We compare the annotation of implicit relations in the Penn Discourse Treebank 2.0, i.e. discourse relations not signaled by an explicit discourse connective, to the recently released analysis of signals of rhetorical relations in the RST Signalling Corpus (RST-SC). The intersection of corresponding pairs of relations is rather a small one, but it shows a clear tendency: unlike the overall signal distribution in the RST-SC, more than half of the signals in the studied intersection are of semantic type, formed mostly by loosely defined lexical chains. Our data transformation allows for a simultaneous depiction and detailed study of the two resources.