2024
pdf
abs
Announcing the Prague Discourse Treebank 3.0
Pavlína Synková
|
Jiří Mírovský
|
Lucie Poláková
|
Magdaléna Rysová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present the Prague Discourse Treebank 3.0 – a new version of the annotation of discourse relations marked by primary and secondary discourse connectives in the data of the Prague Dependency Treebank. Compared to the previous version (PDiT 2.0), the version 3.0 comes with three types of major updates: (i) it brings a largely revised annotation of discourse relations: pragmatic relations have been thoroughly reworked, many inconsistencies across all discourse types have been fixed and previously unclear cases marked in annotators’ comments have been resolved, (ii) it achieves consistency with a Lexicon of Czech Discourse Connectives (CzeDLex), and (iii) it provides the data not only in its native format (Prague Markup Language, discourse relations annotated at the top of the dependency trees), but also in the Penn Discourse Treebank 3.0 format (plain text plus a stand-off discourse annotation) and sense taxonomy. PDiT 3.0 contains 21,662 discourse relations (plus 445 list relations) in 49 thousand sentences.
pdf
abs
Charles Translator: A Machine Translation System between Ukrainian and Czech
Martin Popel
|
Lucie Polakova
|
Michal Novák
|
Jindřich Helcl
|
Jindřich Libovický
|
Pavel Straňák
|
Tomas Krabac
|
Jaroslava Hlavacova
|
Mariia Anisimova
|
Tereza Chlanova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in the required quality. The translator was later implemented as an online web interface and as an Android app with speech input, both featuring Cyrillic-Latin script transliteration. The system translates directly, in comparison to other available systems that use English as a pivot, and thus makes advantage of the typological similarity of the two languages. It uses the block back-translation method which allows for efficient use of monolingual training data. The paper describes the development process including data collection and implementation, evaluation, mentions several use cases and outlines possibilities for further development of the system for educational purposes.
pdf
abs
Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank
Jiří Mírovský
|
Pavlína Synková
|
Lucie Polakova
|
Marie Paclíková
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present a cost-effective method for obtaining a high-quality annotation of explicit discourse relations in the Czech part of the Prague Czech–English Dependency Treebank, a corpus of almost 50 thousand sentences coming from the Czech translation of the Wall Street Journal part of the Penn Treebank. We use three different sources of information and combine them to obtain the discourse annotation: (i) annotation projection from the Penn Discourse Treebank 3.0, (ii) manual tectogrammatical (deep syntax) representation of sentences of the corpus, and (iii) the Lexicon of Czech Discourse Connectives CzeDLex. After solving as many discrepancies as possible automatically, the final discourse annotation is achieved by manual inspection of the remaining problematic cases. The discourse annotation of the corpus will be available both in the Prague format (on top of tectogrammatical trees) with the Prague taxonomy of discourse types, and in the Penn format (on plain texts) with the Penn Discourse Treebank 3.0 sense taxonomy.
pdf
abs
Developing a Rhetorical Structure Theory Treebank for Czech
Lucie Polakova
|
Jiří Mírovský
|
Šárka Zikánová
|
Eva Hajicova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce the first version of the Czech RST Discourse Treebank, a collection of Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST), a global coherence model proposed by Mann and Thompson (1988). Each document in the corpus is represented as a single tree-like structure, where discourse units are interconnected through hierarchical rhetorical relations and their relative importance for the main purpose of a text is modeled by the nuclearity principle. The treebank is freely available in the LINDAT/CLARIAH-CZ repository under the Creative Commons license; for some documents, it includes two gold annotations representing divergent yet relevant interpretations. The paper outlines the annotation process, provides corpus statistics and evaluation, and discusses the issue of consistency associated with the global level of textual interpretation. In general, good agreement on the structure and labeling could be achieved on the lowest, local tree level and on the identification of the most central (nuclear) elementary discourse units. Disagreements mostly concerned segmentation and, in the structure, differences in the stepwise process of linking the largest text blocks. The project contributes to the advancement of RST research and its application to real-world text analysis challenges.
2020
pdf
abs
GeCzLex: Lexicon of Czech and German Anaphoric Connectives
Lucie Poláková
|
Kateřina Rysová
|
Magdaléna Rysová
|
Jiří Mírovský
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce the first version of GeCzLex, an online electronic resource for translation equivalents of Czech and German discourse connectives. The lexicon is one of the outcomes of the research on anaphoricity and long-distance relations in discourse, it contains at present anaphoric connectives (ACs) for Czech and German connectives, and further their possible translations documented in bilingual parallel corpora (not necessarily anaphoric). As a basis, we use two existing monolingual lexicons of connectives: the Lexicon of Czech Discourse Connectives (CzeDLex) and the Lexicon of Discourse Markers (DiMLex) for German, interlink their relevant entries via semantic annotation of the connectives (according to the PDTB 3 sense taxonomy) and statistical information of translation possibilities from the Czech and German parallel data of the InterCorp project. The lexicon is, as far as we know, the first bilingual inventory of connectives with linkage on the level of individual entries, and a first attempt to systematically describe devices engaged in long-distance, non-local discourse coherence. The lexicon is freely available under the Creative Commons License.
pdf
abs
CzeDLex 0.6 and its Representation in the PML-TQ
Jiří Mírovský
|
Lucie Poláková
|
Pavlína Synková
Proceedings of the Twelfth Language Resources and Evaluation Conference
CzeDLex is an electronic lexicon of Czech discourse connectives with its data coming from a large treebank annotated with discourse relations. Its new version CzeDLex 0.6 (as compared with the previous version 0.5, which was published in 2017) is significantly larger with respect to manually processed entries. Also, its structure has been modified to allow for primary connectives to appear with multiple entries for a single discourse sense. The lexicon comes in several formats, being both human and machine readable, and is available for searching in PML Tree Query, a user-friendly and powerful search tool for all kinds of linguistically annotated treebanks. The main purpose of this paper/demo is to present the new version of the lexicon and to demonstrate possibilities of mining various types of information from the lexicon using PML Tree Query; we present several examples of search queries over the lexicon data along with their results. The new version of the lexicon, CzeDLex 0.6, is available on-line and was officially released in December 2019 under the Creative Commons License.
2019
pdf
abs
A Test Suite and Manual Evaluation of Document-Level NMT at WMT19
Kateřina Rysová
|
Magdaléna Rysová
|
Tomáš Musil
|
Lucie Poláková
|
Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
As the quality of machine translation rises and neural machine translation (NMT) is moving from sentence to document level translations, it is becoming increasingly difficult to evaluate the output of translation systems. We provide a test suite for WMT19 aimed at assessing discourse phenomena of MT systems participating in the News Translation Task. We have manually checked the outputs and identified types of translation errors that are relevant to document-level translation.
2017
pdf
Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus
Pavlína Synková
|
Magdaléna Rysová
|
Lucie Poláková
|
Jiří Mírovský
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
2016
pdf
Designing CzeDLex – A Lexicon of Czech Discourse Connectives
Jiří Mírovský
|
Pavlína Jínová
|
Magdaléna Rysová
|
Lucie Poláková
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters
pdf
abs
Searching in the Penn Discourse Treebank Using the PML-Tree Query
Jiří Mírovský
|
Lucie Poláková
|
Jan Štěpánek
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The PML-Tree Query is a general, powerful and user-friendly system for querying richly linguistically annotated treebanks. The paper shows how the PML-Tree Query can be used for searching for discourse relations in the Penn Discourse Treebank 2.0 mapped onto the syntactic annotation of the Penn Treebank.
2014
pdf
abs
Genres in the Prague Discourse Treebank
Lucie Poláková
|
Pavlína Jínová
|
Jiří Mírovský
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present the project of classification of Prague Discourse Treebank documents (Czech journalistic texts) for their genres. Our main interest lies in opening the possibility to observe how text coherence is realized in different types (in the genre sense) of language data and, in the future, in exploring the ways of using genres as a feature for multi-sentence-level language technologies. In the paper, we first describe the motivation and the concept of the genre annotation, and briefly introduce the Prague Discourse Treebank. Then, we elaborate on the process of manual annotation of genres in the treebank, from the annotators’ manual work to post-annotation checks and to the inter-annotator agreement measurements. The annotated genres are subsequently analyzed together with discourse relations (already annotated in the treebank) ― we present distributions of the annotated genres and results of studying distinctions of distributions of discourse relations across the individual genres.
pdf
Discourse Relations in the Prague Dependency Treebank 3.0
Jiří Mírovský
|
Pavlína Jínová
|
Lucie Poláková
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations
2013
pdf
Machine Translation with Many Manually Labeled Discourse Connectives
Thomas Meyer
|
Lucie Poláková
Proceedings of the Workshop on Discourse in Machine Translation
pdf
Subordinators with Elaborative Meanings in Czech and English
Pavlína Jínová
|
Lucie Poláková
|
Jiří Mírovský
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)
pdf
Introducing the Prague Discourse Treebank 1.0
Lucie Poláková
|
Jiří Mírovský
|
Anna Nedoluzhko
|
Pavlína Jínová
|
Šárka Zikánová
|
Eva Hajičová
Proceedings of the Sixth International Joint Conference on Natural Language Processing
2012
pdf
bib
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects
Eva Hajičová
|
Lucie Poláková
|
Jiří Mírovský
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects
pdf
Semi-Automatic Annotation of Intra-Sentential Discourse Relations in PDT
Pavlína Jínová
|
Jiří Mírovský
|
Lucie Poláková
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects
pdf
Does Tectogrammatics Help the Annotation of Discourse?
Jiří Mírovský
|
Pavlína Jínová
|
Lucie Poláková
Proceedings of COLING 2012: Posters
pdf
abs
Interplay of Coreference and Discourse Relations: Discourse Connectives with a Referential Component
Lucie Poláková
|
Pavlína Jínová
|
Jiří Mírovský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This contribution explores the subgroup of text structuring expressions with the form preposition + demonstrative pronoun, thus it is devoted to an aspect of the interaction of coreference relations and relations signaled by discourse connectives (DCs) in a text. The demonstrative pronoun typically signals a referential link to an antecedent, whereas the whole expression can, but does not have to, carry a discourse meaning in sense of discourse connectives. We describe the properties of these phrases/expressions with regard to their antecedents, their position among the text-structuring language means and their features typical for the connective function of them compared to their non-connective function. The analysis is carried out on Czech data from the approx. 50,000 sentences of the Prague Dependency Treebank 2.0, directly on the syntactic trees. We explore the characteristics of these phrases/expressions discovered during two projects: the manual annotation of 1, coreference relations (Nedoluzhko et al. 2011) and 2, discourse connectives, their scopes and meanings (Mladová et al. 2008).