Announcing the Prague Discourse Treebank 3.0
Pavlína Synková
Jiří Mírovský
Lucie Poláková
Magdaléna Rysová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present the Prague Discourse Treebank 3.0 – a new version of the annotation of discourse relations marked by primary and secondary discourse connectives in the data of the Prague Dependency Treebank. Compared to the previous version (PDiT 2.0), the version 3.0 comes with three types of major updates: (i) it brings a largely revised annotation of discourse relations: pragmatic relations have been thoroughly reworked, many inconsistencies across all discourse types have been fixed and previously unclear cases marked in annotators’ comments have been resolved, (ii) it achieves consistency with a Lexicon of Czech Discourse Connectives (CzeDLex), and (iii) it provides the data not only in its native format (Prague Markup Language, discourse relations annotated at the top of the dependency trees), but also in the Penn Discourse Treebank 3.0 format (plain text plus a stand-off discourse annotation) and sense taxonomy. PDiT 3.0 contains 21,662 discourse relations (plus 445 list relations) in 49 thousand sentences.
GeCzLex: Lexicon of Czech and German Anaphoric Connectives
Lucie Poláková
Kateřina Rysová
Magdaléna Rysová
Jiří Mírovský
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce the first version of GeCzLex, an online electronic resource for translation equivalents of Czech and German discourse connectives. The lexicon is one of the outcomes of the research on anaphoricity and long-distance relations in discourse, it contains at present anaphoric connectives (ACs) for Czech and German connectives, and further their possible translations documented in bilingual parallel corpora (not necessarily anaphoric). As a basis, we use two existing monolingual lexicons of connectives: the Lexicon of Czech Discourse Connectives (CzeDLex) and the Lexicon of Discourse Markers (DiMLex) for German, interlink their relevant entries via semantic annotation of the connectives (according to the PDTB 3 sense taxonomy) and statistical information of translation possibilities from the Czech and German parallel data of the InterCorp project. The lexicon is, as far as we know, the first bilingual inventory of connectives with linkage on the level of individual entries, and a first attempt to systematically describe devices engaged in long-distance, non-local discourse coherence. The lexicon is freely available under the Creative Commons License.
A Test Suite and Manual Evaluation of Document-Level NMT at WMT19
Kateřina Rysová
Magdaléna Rysová
Tomáš Musil
Lucie Poláková
Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
As the quality of machine translation rises and neural machine translation (NMT) is moving from sentence to document level translations, it is becoming increasingly difficult to evaluate the output of translation systems. We provide a test suite for WMT19 aimed at assessing discourse phenomena of MT systems participating in the News Translation Task. We have manually checked the outputs and identified types of translation errors that are relevant to document-level translation.
EvalD Reference-Less Discourse Evaluation for WMT18
Ondřej Bojar
Jiří Mírovský
Kateřina Rysová
Magdaléna Rysová
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
We present the results of automatic evaluation of discourse in machine translation (MT) outputs using the EVALD tool. EVALD was originally designed and trained to assess the quality of human writing, for native speakers and foreign-language learners. MT has seen a tremendous leap in translation quality at the level of sentences and it is thus interesting to see if the human-level evaluation is becoming relevant.
Introducing EVALD – Software Applications for Automatic Evaluation of Discourse in Czech
Kateřina Rysová
Magdaléna Rysová
Jiří Mírovský
Michal Novák
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
In the paper, we introduce two software applications for automatic evaluation of coherence in Czech texts called EVALD – Evaluator of Discourse. The first one – EVALD 1.0 – evaluates texts written by native speakers of Czech on a five-step scale commonly used at Czech schools (grade 1 is the best, grade 5 is the worst). The second application is EVALD 1.0 for Foreigners assessing texts by non-native speakers of Czech using six-step scale (A1–C2) according to CEFR. Both appli-cations are available online at
Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus
Pavlína Synková
Magdaléna Rysová
Lucie Poláková
Jiří Mírovský
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
Designing CzeDLex – A Lexicon of Czech Discourse Connectives
Jiří Mírovský
Pavlína Jínová
Magdaléna Rysová
Lucie Poláková
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters
Automatic evaluation of surface coherence in L2 texts in Czech
Kateřina Rysová
Magdaléna Rysová
Jiří Mírovský
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016)
Secondary Connectives in the Prague Dependency Treebank
Magdaléna Rysová
Kateřina Rysová
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)
Use of Coreference in Automatic Searching for Multiword Discourse Markers in the Prague Dependency Treebank
Magdaléna Rysová
Jiří Mírovský
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop
Verbs of Saying with a Textual Connecting Function in the Prague Discourse Treebank
Magdaléna Rysová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The paper tries to contribute to the general discussion on discourse connectives, concretely to the question whether it is meaningful to distinguish two separate groups of connectives ― i.e. “classical” connectives limited to few predefined classes like conjunctions or adverbs (e.g. “but”) vs. alternative lexicalizations of connectives (i.e. unrestricted expressions and phrases like “the reason is”, “he added”, “the condition was” etc.). In this respect, the paper focuses on one group of these broader connectives in Czech ― the selected verbs of saying “doplnit/doplňovat” (“to complement”), “upřesnit/upřesňovat” (“to specify”), “dodat/dodávat” (“to add”), “pokračovat” (“to continue”) ― and analyses their occurrence and function in texts from the Prague Discourse Treebank. The paper demonstrates that these verbs of saying have a special place within the other connectives, as they contain two items ― e.g. “he added” means “and he said” so the verb “to add” contains an information about the relation to the previous context (“and”) plus the verb of saying (“to say”). This information led us to a more general observation, i.e. discourse connectives in broader sense do not necessarily connect two pieces of a text but some of them carry the second argument right in their semantics, which “classical” connectives can never do.
The Centre and Periphery of Discourse Connectives
Magdaléna Rysová
Kateřina Rysová
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing
(Pre-)Annotation of Topic-Focus Articulation in Prague Czech-English Dependency Treebank
Jiří Mírovský
Kateřina Rysová
Magdaléna Rysová
Eva Hajičová
Proceedings of the Sixth International Joint Conference on Natural Language Processing
Alternative Lexicalizations of Discourse Connectives in Czech
Magdaléna Rysová
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The paper concentrates on which language means may be included into the annotation of discourse relations in the Prague Dependency Treebank (PDT) and tries to examine the so called alternative lexicalizations of discourse markers (AltLex's) in Czech. The analysis proceeds from the annotated data of PDT and tries to draw a comparison between the Czech AltLex's from PDT and English AltLex's from PDTB (the Penn Discourse Treebank). The paper presents a lexico-syntactic and semantic characterization of the Czech AltLex's and comments on the current stage of their annotation in PDT. In the current version, PDT contains 306 expressions (within the total 43,955 of sentences) that were labeled by annotators as being an AltLex. However, as the analysis demonstrates, this number is not final. We suppose that it will increase after the further elaboration, as AltLex's are not restricted to a limited set of syntactic classes and some of them exhibit a great degree of variation.