Rafał Poświata


2020

pdf bib
Annobot: Platform for Annotating and Creating Datasets through Conversation with a Chatbot
Rafał Poświata | Michał Perełkiewicz
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

In this paper, we introduce Annobot: a platform for annotating and creating datasets through conversation with a chatbot. This natural form of interaction has allowed us to create a more accessible and flexible interface, especially for mobile devices. Our solution has a wide range of applications such as data labelling for binary, multi-class/label classification tasks, preparing data for regression problems, or creating sets for issues such as machine translation, question answering or text summarization. Additional features include pre-annotation, active sampling, online learning and real-time inter-annotator agreement. The system is integrated with the popular messaging platform: Facebook Messanger. Usability experiment showed the advantages of the proposed platform compared to other labelling tools. The source code of Annobot is available under the GNU LGPL license at https://github.com/rafalposwiata/annobot.

pdf bib
Evaluation of Sentence Representations in Polish
Slawomir Dadas | Michał Perełkiewicz | Rafał Poświata
Proceedings of the 12th Language Resources and Evaluation Conference

Methods for learning sentence representations have been actively developed in recent years. However, the lack of pre-trained models and datasets annotated at the sentence level has been a problem for low-resource languages such as Polish which led to less interest in applying these methods to language-specific tasks. In this study, we introduce two new Polish datasets for evaluating sentence embeddings and provide a comprehensive evaluation of eight sentence representation methods including Polish and multilingual models. We consider classic word embedding models, recently developed contextual embeddings and multilingual sentence encoders, showing strengths and weaknesses of specific approaches. We also examine different methods of aggregating word vectors into a single sentence vector.

2019

pdf bib
Numbers Normalisation in the Inflected Languages: a Case Study of Polish
Rafał Poświata | Michał Perełkiewicz
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Text normalisation in Text-to-Speech systems is a process of converting written expressions to their spoken forms. This task is complicated because in many cases the normalised form depends on the context. Furthermore, when we analysed languages like Croatian, Lithuanian, Polish, Russian or Slovak there is additional difficulty related to their inflected nature. In this paper we want to show how to deal with this problem for one of these languages: Polish, without having a large dedicated data set and using solutions prepared for other NLP tasks. We limited our study to only numbers expressions, which are the most common non-standard words to normalise. The proposed solution is a combination of morphological tagger and transducer supported by a dictionary of numbers in their spoken forms. The data set used for evaluation is based on the part of 1-million word subset of the National Corpus of Polish. The accuracy of the described approach is presented with a comparison to a simple baseline and two commercial systems: Google Cloud Text-to-Speech and Amazon Polly.

pdf bib
ConSSED at SemEval-2019 Task 3: Configurable Semantic and Sentiment Emotion Detector
Rafał Poświata
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system participating in the SemEval-2019 Task 3: EmoContext: Contextual Emotion Detection in Text. The goal was to for a given textual dialogue, i.e. a user utterance along with two turns of context, identify the emotion of user utterance as one of the emotion classes: Happy, Sad, Angry or Others. Our system: ConSSED is a configurable combination of semantic and sentiment neural models. The official task submission achieved a micro-average F1 score of 75.31 which placed us 16th out of 165 participating systems.