Ondřej Pražák


2021

pdf bib
Multilingual Coreference Resolution with Harmonized Annotations
Ondřej Pražák | Miloslav Konopík | Jakub Sido
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al.,2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models - for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.

pdf bib
Czert – Czech BERT-like Model for Language Representation
Jakub Sido | Ondřej Pražák | Pavel Přibáň | Jan Pašek | Michal Seják | Miloslav Konopík
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.

2020

pdf bib
UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection
Ondřej Pražák | Pavel Přibáň | Stephen Taylor | Jakub Sido
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.

2019

pdf bib
ULSAna: Universal Language Semantic Analyzer
Ondřej Pražák | Miloslav Konopik
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We present a live cross-lingual system capable of producing shallow semantic annotations of natural language sentences for 51 languages at this time. The domain of the input sentences is in principle unconstrained. The system uses single training data (in English) for all the languages. The resulting semantic annotations are therefore consistent across different languages. We use CoNLL Semantic Role Labeling training data and Universal dependencies as the basis for the system. The system is publicly available and supports processing data in batches; therefore, it can be easily used by the community for the following research tasks.

2017

pdf bib
Czech Dataset for Semantic Similarity and Relatedness
Miloslav Konopík | Ondřej Pražák | David Steinberger
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This paper introduces a Czech dataset for semantic similarity and semantic relatedness. The dataset contains word pairs with hand annotated scores that indicate the semantic similarity and semantic relatedness of the words. The dataset contains 953 word pairs compiled from 9 different sources. It contains words and their contexts taken from real text corpora including extra examples when the words are ambiguous. The dataset is annotated by 5 independent annotators. The average Spearman correlation coefficient of the annotation agreement is r = 0.81. We provide reference evaluation experiments with several methods for computing semantic similarity and relatedness.

pdf bib
Cross-Lingual SRL Based upon Universal Dependencies
Ondřej Pražák | Miloslav Konopík
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper, we introduce a cross-lingual Semantic Role Labeling (SRL) system with language independent features based upon Universal Dependencies. We propose two methods to convert SRL annotations from monolingual dependency trees into universal dependency trees. Our SRL system is based upon cross-lingual features derived from universal dependency trees and a supervised learning that utilizes a maximum entropy classifier. We design experiments to verify whether the Universal Dependencies are suitable for the cross-lingual SRL. The results are very promising and they open new interesting research paths for the future.

2016

pdf bib
UWB at SemEval-2016 Task 2: Interpretable Semantic Textual Similarity with Distributional Semantics for Chunks
Miloslav Konopík | Ondřej Pražák | David Steinberger | Tomáš Brychcín
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)