Pavel Král

Also published as: Pavel Kral


Evaluation Datasets for Cross-lingual Semantic Textual Similarity
Tomáš Hercig | Pavel Kral
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Semantic textual similarity (STS) systems estimate the degree of the meaning similarity between two sentences. Cross-lingual STS systems estimate the degree of the meaning similarity between two sentences, each in a different language. State-of-the-art algorithms usually employ a strongly supervised, resource-rich approach difficult to use for poorly-resourced languages. However, any approach needs to have evaluation data to confirm the results. In order to simplify the evaluation process for poorly-resourced languages (in terms of STS evaluation datasets), we present new datasets for cross-lingual and monolingual STS for languages without this evaluation data. We also present the results of several state-of-the-art methods on these data which can be used as a baseline for further research. We believe that this article will not only extend the current STS research to other languages, but will also encourage competition on this new evaluation data.

Transfer Learning for Czech Historical Named Entity Recognition
Helena Hubková | Pavel Kral
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Nowadays, named entity recognition (NER) achieved excellent results on the standard corpora. However, big issues are emerging with a need for an application in a specific domain, because it requires a suitable annotated corpus with adapted NE tag-set. This is particularly evident in the historical document processing field. The main goal of this paper consists of proposing and evaluation of several transfer learning methods to increase the score of the Czech historical NER. We study several information sources, and we use two neural nets for NE modeling and recognition. We employ two corpora for evaluation of our transfer learning methods, namely Czech named entity corpus and Czech historical named entity corpus. We show that BERT representation with fine-tuning and only the simple classifier trained on the union of corpora achieves excellent results.


UWB@FinTOC-2020 Shared Task: Financial Document Title Detection
Tomáš Hercig | Pavel Kral
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

This paper describes our system created for the Financial Document Structure Extraction Shared Task (FinTOC-2020): Title Detection. We rely on the Apache PDFBox library to extract text and all additional information e.g. font type and font size from the financial prospectuses. Our constrained system uses only the provided training data without any additional external resources. Our system is based on the Maximum Entropy classifier and various features including font type and font size. Our system achieves F1 score 81% and #1 place in the French track and F1 score 77% and #2 place among 5 participating teams in the English track.

Czech Historical Named Entity Corpus v 1.0
Helena Hubková | Pavel Kral | Eva Pettersson
Proceedings of the Twelfth Language Resources and Evaluation Conference

As the number of digitized archival documents increases very rapidly, named entity recognition (NER) in historical documents has become very important for information extraction and data mining. For this task an annotated corpus is needed, which has up to now been missing for Czech. In this paper we present a new annotated data collection for historical NER, composed of Czech historical newspapers. This corpus is freely available for research purposes. For this corpus, we have defined relevant domain-specific named entity types and created an annotation manual for corpus labelling. We further conducted some experiments on this corpus using recurrent neural networks. We experimented with randomly initialized embeddings and static and dynamic fastText word embeddings. We achieved 0.73 F1 score with a bidirectional LSTM model using static fastText embeddings.


Czech Text Document Corpus v 2.0
Pavel Král | Ladislav Lenc
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


Unsupervised Dialogue Act Induction using Gaussian Mixtures
Tomáš Brychcín | Pavel Král
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper introduces a new unsupervised approach for dialogue act induction. Given the sequence of dialogue utterances, the task is to assign them the labels representing their function in the dialogue. Utterances are represented as real-valued vectors encoding their meaning. We model the dialogue as Hidden Markov model with emission probabilities estimated by Gaussian mixtures. We use Gibbs sampling for posterior inference. We present the results on the standard Switchboard-DAMSL corpus. Our algorithm achieves promising results compared with strong supervised baselines and outperforms other unsupervised algorithms.

Word Embeddings for Multi-label Document Classification
Ladislav Lenc | Pavel Král
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper, we analyze and evaluate word embeddings for representation of longer texts in the multi-label classification scenario. The embeddings are used in three convolutional neural network topologies. The experiments are realized on the Czech ČTK and English Reuters-21578 standard corpora. We compare the results of word2vec static and trainable embeddings with randomly initialized word vectors. We conclude that initialization does not play an important role for classification. However, learning of word vectors is crucial to obtain good results.


UWB at SemEval-2016 Task 7: Novel Method for Automatic Sentiment Intensity Determination
Ladislav Lenc | Pavel Král | Václav Rajtmajer
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)