Marek Łaziński
2026
The Corpus of Contemporary Polish — a New Reference Corpus with Rich Syntactic Annotations
Witold Kieraś | Małgorzata Marciniak | Marcin Woliński | Katarzyna Krasnowska-Kieraś | Marek Łaziński
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Witold Kieraś | Małgorzata Marciniak | Marcin Woliński | Katarzyna Krasnowska-Kieraś | Marek Łaziński
Proceedings of the Fifteenth Language Resources and Evaluation Conference
In the paper, we describe the Corpus of Contemporary Polish (KWJP) and its rich syntactic annotation. The corpus covers a wide range of text originally published between 2011 and 2020. Although it carries on the idea of providing up-to-date reference corpora of Polish initiated by the National Corpus of Polish (NKJP) project, the principles underlying its development are not the same. In this article, we outline the different choices that affect corpora content and give an explanation for them. The article focuses mainly on the description of annotation layers in KWJP which are generated with a neural network based tool specially developed for this purpose. We describe in details syntactic structure annotation, which is represented by hybrid trees combining information typical to constituency and dependency trees. Finally, we provide several examples showing how annotation with hybrid trees facilitates querying and effective searching for information in the corpus.
2010
Recent Developments in the National Corpus of Polish
Adam Przepiórkowski | Rafał L. Górski | Marek Łaziński | Piotr Pęzik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Adam Przepiórkowski | Rafał L. Górski | Marek Łaziński | Piotr Pęzik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The aim of the paper is to present recent ― as of March 2010 ― developments in the construction of the National Corpus of Polish (NKJP). The NKJP project was launched at the very end of 2007 and it is aimed at compiling a large, linguistically annotated corpus of contemporary Polish by the end of 2010. Out of the total pool of 1 billion words of text data collected in the project, a 300 million word balanced corpus will be selected to match a set of predefined representativeness criteria. This present paper outlines a number of recent developments in the NKJP project, including: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools. As the work on NKJP progresses, it becomes clear that this project serves as an important testbed for linguistic annotation and interoperability standards. We believe that our recent experiences will prove relevant to future large-scale language resource compilation efforts.
2008
Towards the National Corpus of Polish
Adam Przepiórkowski | Rafał L. Górski | Barbara Lewandowska-Tomaszyk | Marek Łaziński
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Adam Przepiórkowski | Rafał L. Górski | Barbara Lewandowska-Tomaszyk | Marek Łaziński
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper presents a new corpus project, aiming at building a national corpus of Polish. What makes it different from a typical YACP (Yet Another Corpus Project) is 1) the fact that all four partners in the project have in the past constructed corpora of Polish, sometimes in the spirit of collaboration, at other times - in the spirit of competition, 2) the partners bring into the project varying areas of expertise and experience, so the synergy effect is anticipated, 3) the corpus will be built with an eye on specific applications in various fields, including lexicography (the corpus will be the empirical basis of a new large general dictionary of Polish) and natural language processing (a number of NLP tools will be constructed within the project).