Jan Wieczorek


2023

pdf
ISO 24617-2 on a cusp of languages
Krzysztof Hwaszcz | Marcin Oleksy | Aleksandra Domogała | Jan Wieczorek
Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19)

The article discusses the challenges of cross-linguistic dialogue act annotation, which involves using methods developed for one language to annotate conversations in another language. The article specifically focuses on the research on dialogue act annotation in Polish, based on the ISO standard developed for English. The article examines the differences between Polish and English in dialogue act annotation based on selected examples from DiaBiz.Kom corpus, such as the use of honorifics in Polish, the use of inflection to convey meaning in Polish, the tendency to use complex sentence structures in Polish, and the cultural differences that may play a role in the annotation of dialogue acts. The article also discusses the creation of DiaBiz.Kom, a Polish dialogue corpus based on ISO 24617-2 standard applied to 1100 transcripts.

2022

pdf
Towards a contextualised spatial-diachronic history of literature: mapping emotional representations of the city and the country in Polish fiction from 1864 to 1939
Agnieszka Karlińska | Cezary Rosiński | Jan Wieczorek | Patryk Hubar | Jan Kocoń | Marek Kubis | Stanisław Woźniak | Arkadiusz Margraf | Wiktor Walentynowicz
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this article, we discuss the conditions surrounding the building of historical and literary corpora. We describe the assumptions and method of making the original corpus of the Polish novel (1864-1939). Then, we present the research procedure aimed at demonstrating the variability of the emotional value of the concept of “the city” and “the country” in the texts included in our corpus. The proposed method considers the complex socio-political nature of Central and Eastern Europe, especially the fact that there was no unified Polish state during this period. The method can be easily replicated in studies of the literature of countries with similar specificities.

pdf
DiaBiz.Kom - towards a Polish Dialogue Act Corpus Based on ISO 24617-2 Standard
Marcin Oleksy | Jan Wieczorek | Dorota Drużyłowska | Julia Klyus | Aleksandra Domogała | Krzysztof Hwaszcz | Hanna Kędzierska | Daria Mikoś | Anita Wróż
Proceedings of the 29th International Conference on Computational Linguistics

This article presents the specification and evaluation of DiaBiz.Kom – the corpus of dialogue texts in Polish. The corpus contains transcriptions of telephone conversations conducted according to a prepared scenario. The transcripts of conversations have been manually annotated with a layer of information concerning communicative functions. DiaBiz.Kom is the first corpus of this type prepared for the Polish language and will be used to develop a system of dialog analysis and modules for creating advanced chatbots.

2020

pdf
PST 2.0 – Corpus of Polish Spatial Texts
Michał Marcińczuk | Marcin Oleksy | Jan Wieczorek
Proceedings of the Twelfth Language Resources and Evaluation Conference

In the paper, we focus on modeling spatial expressions in texts. We present the guidelines used to annotate the PST 2.0 (Corpus of Polish Spatial Texts) — a corpus designed for training and testing the tools for spatial expression recognition. The corpus contains a set of texts gathered from texts collected from travel blogs available under Creative Commons license. We have defined our guidelines based on three existing specifications for English (SpatialML, SpatialRole Labelling from SemEval-2013 Task 3 and ISO-Space1.4 from SpaceEval 2014). We briefly present the existing specifications and discuss what modifications have been made to adapt the guidelines to the characteristics of the Polish language. We also describe the process of data collection and manual annotation, including inter-annotator agreement calculation and corpus statistics. In the end, we present detailed statistics of the PST 2.0 corpus, which include the number of components, relations, expressions, and the most common values of spatial indicators, motion indicators, path indicators, distances, directions, and regions.