Katarzyna Krasnowska-Kieraś
2026
The Corpus of Contemporary Polish — a New Reference Corpus with Rich Syntactic Annotations
Witold Kieraś | Małgorzata Marciniak | Marcin Woliński | Katarzyna Krasnowska-Kieraś | Marek Łaziński
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Witold Kieraś | Małgorzata Marciniak | Marcin Woliński | Katarzyna Krasnowska-Kieraś | Marek Łaziński
Proceedings of the Fifteenth Language Resources and Evaluation Conference
In the paper, we describe the Corpus of Contemporary Polish (KWJP) and its rich syntactic annotation. The corpus covers a wide range of text originally published between 2011 and 2020. Although it carries on the idea of providing up-to-date reference corpora of Polish initiated by the National Corpus of Polish (NKJP) project, the principles underlying its development are not the same. In this article, we outline the different choices that affect corpora content and give an explanation for them. The article focuses mainly on the description of annotation layers in KWJP which are generated with a neural network based tool specially developed for this purpose. We describe in details syntactic structure annotation, which is represented by hybrid trees combining information typical to constituency and dependency trees. Finally, we provide several examples showing how annotation with hybrid trees facilitates querying and effective searching for information in the corpus.
2024
Parsing Headed Constituencies
Katarzyna Krasnowska-Kieraś | Marcin Woliński
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Katarzyna Krasnowska-Kieraś | Marcin Woliński
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In the paper, we present a parsing technique that generates headed constituency trees, which combine information typically contained in constituency and dependency trees. We advocate for using such structures for syntactic representation. The parsing method combines prediction of dependency links with prediction of constituency spines in a ‘parsing as tagging’ approach and outputs a hybrid structure. An interesting feature is that the method can generate constituency trees with discontinuities. The parser is built on top of a BERT model for the given language and uses a specially crafted classifier for predicting dependency links. With suitable training data the method can be applied to arbitrary language; we report evaluation results for Polish and German.
2019
Empirical Linguistic Study of Sentence Embeddings
Katarzyna Krasnowska-Kieraś | Alina Wróblewska
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Katarzyna Krasnowska-Kieraś | Alina Wróblewska
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
The purpose of the research is to answer the question whether linguistic information is retained in vector representations of sentences. We introduce a method of analysing the content of sentence embeddings based on universal probing tasks, along with the classification datasets for two contrasting languages. We perform a series of probing and downstream experiments with different types of sentence embeddings, followed by a thorough analysis of the experimental results. Aside from dependency parser-based embeddings, linguistic information is retained best in the recently proposed LASER sentence embeddings.
2017
Polish evaluation dataset for compositional distributional semantics models
Alina Wróblewska | Katarzyna Krasnowska-Kieraś
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alina Wróblewska | Katarzyna Krasnowska-Kieraś
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The paper presents a procedure of building an evaluation dataset. for the validation of compositional distributional semantics models estimated for languages other than English. The procedure generally builds on steps designed to assemble the SICK corpus, which contains pairs of English sentences annotated for semantic relatedness and entailment, because we aim at building a comparable dataset. However, the implementation of particular building steps significantly differs from the original SICK design assumptions, which is caused by both lack of necessary extraneous resources for an investigated language and the need for language-specific transformation rules. The designed procedure is verified on Polish, a fusional language with a relatively free word order, and contributes to building a Polish evaluation dataset. The resource consists of 10K sentence pairs which are human-annotated for semantic relatedness and entailment. The dataset may be used for the evaluation of compositional distributional semantics models of Polish.