Barbora Štěpánková
2026
SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech
Barbora Štěpánková | Michal Novák | Tomáš Musil | Lucie Polakova
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Barbora Štěpánková | Michal Novák | Tomáš Musil | Lucie Polakova
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present a project focused on linguistic description, annotation and automatic classification of the so-called epistemic markers in Czech. These expressions, such as pravděpodobně ‘probably’, zřejmě ‘apparently’ and určitě ‘certainly’, typically operate within the pragmatic domain of language. We introduce a dataset containing manual annotations of the 40 most frequent epistemic markers in Czech, totalling almost 4,000 uses. This annotation was created using parallel InterCorp data (in Czech and English) and the TEITOK tool. We describe the annotation scheme used, the annotation process and data handling. The dataset forms the core of the emerging lexical database of these expressions (SEEMLex). Thanks to the comprehensive manual annotation, the dataset can also serve as a source of further pragmatic information and can be used as a basis for further linguistic research. The proposed annotation scheme can also be used for other languages. To demonstrate the dataset’s utility for automatic classification, we trained XLM-RoBERTa classifiers using 10-fold cross-validation, achieving 72.6% accuracy for type of use classification (6 classes) and 54.2% accuracy for degree of certainty classification (4 classes).
Prague Dependency Treebank - Consolidated 2.0: Enriching a Complex Annotation Scheme
Marie Mikulová | Jiří Mírovský | Milan Straka | Pavlína Synková | Jan Štěpánek | Barbora Štěpánková | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Marie Mikulová | Jiří Mírovský | Milan Straka | Pavlína Synková | Jan Štěpánek | Barbora Štěpánková | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The Prague Dependency Treebank framework is unique in its attempt to systematically include and link different layers of language, including a meaning representation with several types of inter-sentential phenomena, especially coreference and discourse relation. We present its second consolidated version (PDT-C 2.0), which concludes almost 30-years long project of sustained development of the resource to a uniformly and coherently annotated, genre-diversified, almost 4 million token language resource of Czech language, with accompanying fully compatible lexicons. In addition to continuous linguistic research, the richly linguistically annotated corpus is also widely used in international comparisons of the development of traditional and novel NLP tools as well as in conversions into other formalisms. The corpus and the trained parsers are available under the CC BY-NC-SA licence.
Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies
Marie Mikulová | Barbora Štěpánková | Daniel Zeman | Jan Štěpánek | Milan Straka | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Marie Mikulová | Barbora Štěpánková | Daniel Zeman | Jan Štěpánek | Milan Straka | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Czech has been part of Universal Dependencies since its first release in 2015. It has also been one of the best represented languages, with the Prague Dependency Treebank being order of magnitude larger than most other UD treebanks. More recently, three other datasets from the Prague family were added and the annotations thoroughly revisited, forming the "Prague Dependency Treebank-Consolidated" (PDT-C). In comparison to the original PDT, PDT-C is more than twice as large, but it is also much more diverse in terms of genres and domains. In this paper, we describe the conversion of the new resource to Universal Dependencies. While the two annotation schemes are relatively similar at the first sight, there are numerous small differences in topology of the dependency structures and in granularity of the POS and relation type inventories. We demonstrate a selection of such differences on examples, discuss the diverging motivations, as well as ways to overcome the differences during conversion. We argue that while PDT is less "universal" and more tightly bound to one language, its multi-layer annotation is rich and provides all information needed for basic UD trees, and much more.
MorfFlex: Handling Rich Morphology
Jaroslava Hlaváčová | Marie Mikulová | Barbora Štěpánková | Milan Straka | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Jaroslava Hlaváčová | Marie Mikulová | Barbora Štěpánková | Milan Straka | Jan Hajič
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present MorfFlex, a morphological dictionary architecture suitable for languages with extensive regularity in both inflection and derivation. As the primary example of MorfFlex in use we introduce MorfFlex CZ, a morphological dictionary of Czech. It is distributed as a simple, unstructured list of <wordform, lemma, tag> triplets, however, its manually maintained, unpublished source files and conversion scripts encode a sophisticated system of inflectional and derivational patterns. These patterns dramatically reduce the otherwise enormous size of the dictionary, which currently contains over 100 million wordforms and more than 1 million lemmas. The MorfFlex CZ dictionary serves as an essential resource for ensuring the consistency of manual morphological annotation in the Prague Dependency Treebanks and underpins state-of-the-art automatic tools such as MorphoDiTa. In this paper, we focus on: (i) presenting an effective method for managing the rich morphological system within the dictionary, and (ii) demonstrating the utility of such a language resource for maintaining annotation consistency in corpora and supporting the development of advanced NLP applications.
2025
Song Lyrics Adaptations: Computational Interpretation of the Pentathlon Principle
Barbora Štěpánková | Rudolf Rosa
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Barbora Štěpánková | Rudolf Rosa
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Songs are an integral part of human culture, and they often resonate the most when we can sing them in our native language. However, translating song lyrics presents a unique challenge: maintaining singability, naturalness, and semantic fidelity. In this work, we computationally interpret Low’s Pentathlon Principle of singable translations to be able to properly measure the quality of adapted lyrics, breaking it down into five measurable metrics that reflect the key aspects of singable translations. Building on this foundation, we introduce a text-to-text song lyrics translation system based on generative large language models, designed to meet the Pentathlon Principle’s criteria, without relying on melodies or bilingual training data.We experiment on the English-Czech language pair: we collect a dataset of English-to-Czech bilingual song lyrics and identify the desirable values of the five Pentathlon Principle metrics based on the values achieved by human translators. Through detailed human assessment of automatically generated lyric translations, we confirm the appropriateness of the proposed metrics as well as the general validity of the Pentathlon Principle, with some insights into the variation in people’s individual preferences. All code and data are available at https://github.com/stepankovab/Computational-Interpretation-of-the-Pentathlon-Principle.
From Form to Meaning: The Case of Particles within the Prague Dependency Treebank Annotation Scheme
Marie Mikulova | Barbora Štěpánková | Jan Štěpánek
Proceedings of the 31st International Conference on Computational Linguistics
Marie Mikulova | Barbora Štěpánková | Jan Štěpánek
Proceedings of the 31st International Conference on Computational Linguistics
In the last decades, computational linguistics has become increasingly interested in annotation schemes that aim at an adequate description of the meaning of the sentences and texts. Discussions are ongoing on an appropriate annotation scheme for a large and complex amount of diverse information. In this contribution devoted to description of polyfunctional uninflected words (namely particles), i.e. words which, although having only one paradigmatic form, can have several different syntactic functions and even express relatively different semantic distinctions, we argue that it is the multi-layer system (linked from meaning to text) that allows a comprehensive description of the relations between morphological properties, syntactic function and expressed meaning, and thus contributes to greater accuracy in the description of the phenomena concerned and to the overall consistency of the annotated data. These aspects are demonstrated within the Prague Dependency Treebank annotation scheme, whose pioneering proposal can be found in the first COLING proceedings from 1965 (Sgall 1965), and to this day, the concept has proved to be sound and serves very well for complex annotation.
2022
Advantages of a Complex Multilayer Annotation Scheme: The Case of the Prague Dependency Treebank
Eva Hajičová | Marie Mikulová | Barbora Štěpánková | Jiří Mírovský
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
Eva Hajičová | Marie Mikulová | Barbora Štěpánková | Jiří Mírovský
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
Recently, many corpora have been developed that contain multiple annotations of various linguistic phenomena, from morphological categories of words through the syntactic structure of sentences to discourse and coreference relations in texts. Discussions are ongoing on an appropriate annotation scheme for a large amount of diverse information. In our contribution we express our conviction that a multilayer annotation scheme offers to view the language system in its complexity and in the interaction of individual phenomena and that there are at least two aspects that support such a scheme: (i) A multilayer annotation scheme makes it possible to use the annotation of one layer to design the annotation of another layer(s) both conceptually and in a form of a pre-annotation procedure or annotation checking rules. (ii) A multilayer annotation scheme presents a reliable ground for corpus studies based on features across the layers. These aspects are demonstrated on the case of the Prague Dependency Treebank. Its multilayer annotation scheme withstood the test of time and serves well also for complex textual annotations, in which earlier morpho-syntactic annotations are advantageously used. In addition to a reference to the previous projects that utilise its annotation scheme, we present several current investigations.
Quality and Efficiency of Manual Annotation: Pre-annotation Bias
Marie Mikulová | Milan Straka | Jan Štěpánek | Barbora Štěpánková | Jan Hajic
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Marie Mikulová | Milan Straka | Jan Štěpánek | Barbora Štěpánková | Jan Hajic
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents an analysis of annotation using an automatic pre-annotation for a mid-level annotation complexity task - dependency syntax annotation. It compares the annotation efforts made by annotators using a pre-annotated version (with a high-accuracy parser) and those made by fully manual annotation. The aim of the experiment is to judge the final annotation quality when pre-annotation is used. In addition, it evaluates the effect of automatic linguistically-based (rule-formulated) checks and another annotation on the same data available to the annotators, and their influence on annotation quality and efficiency. The experiment confirmed that the pre-annotation is an efficient tool for faster manual syntactic annotation which increases the consistency of the resulting annotation without reducing its quality.
2020
Prague Dependency Treebank - Consolidated 1.0
Jan Hajič | Eduard Bejček | Jaroslava Hlavacova | Marie Mikulová | Milan Straka | Jan Štěpánek | Barbora Štěpánková
Proceedings of the Twelfth Language Resources and Evaluation Conference
Jan Hajič | Eduard Bejček | Jaroslava Hlavacova | Marie Mikulová | Milan Straka | Jan Štěpánek | Barbora Štěpánková
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research. PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here). The texts come from different sources: daily newspaper articles, Czech translation of the Wall Street Journal, transcribed dialogs and a small amount of user-generated, short, often non-standard language segments typed into a web translator. Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation. The diversity of the texts and annotations should serve well the NLP applications as well as it is an invaluable resource for linguistic research, including comparative studies regarding texts of different genres. The corpus is publicly and freely available.