Lubomir Otrusina


2017

pdf
Semantic Enrichment Across Language: A Case Study of Czech Bibliographic Databases
Pavel Smrz | Lubomir Otrusina
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf
WTF-LOD - A New Resource for Large-Scale NER Evaluation
Lubomir Otrusina | Pavel Smrz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper introduces the Web TextFull linkage to Linked Open Data (WTF-LOD) dataset intended for large-scale evaluation of named entity recognition (NER) systems. First, we present the process of collecting data from the largest publically-available textual corpora, including Wikipedia dumps, monthly runs of the CommonCrawl, and ClueWeb09/12. We discuss similarities and differences of related initiatives such as WikiLinks and WikiReverse. Our work primarily focuses on links from “textfull” documents (links surrounded by a text that provides a useful context for entity linking), de-duplication of the data and advanced cleaning procedures. Presented statistics demonstrate that the collected data forms one of the largest available resource of its kind. They also prove suitability of the result for complex NER evaluation campaigns, including an analysis of the most ambiguous name mentions appearing in the data.

2014

pdf
Deep Learning from Web-Scale Corpora for Better Dictionary Interfaces
Pavel Smrz | Lubomir Otrusina
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

2013

pdf
BUT-TYPED: Using domain knowledge for computing typed similarity
Lubomir Otrusina | Pavel Smrz
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2010

pdf
A New Approach to Pseudoword Generation
Lubomir Otrusina | Pavel Smrz
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Sense-tagged corpora are used to evaluate word sense disambiguation (WSD) systems. Manual creation of such resources is often prohibitively expensive. That is why the concept of pseudowords - conflations of two or more unambiguous words - has been integrated into WSD evaluation experiments. This paper presents a new method of pseudoword generation which takes into account semantic-relatedness of the candidate words forming parts of the pseudowords to the particular senses of the word to be disambiguated. We compare the new approach to its alternatives and show that the results on pseudowords, that are more similar to real ambiguous words, better correspond to the actual results. Two techniques assessing the similarity are studied - the first one takes advantage of manually created dictionaries (wordnets), the second one builds on the automatically computed statistical data obtained from large corpora. Pros and cons of the two techniques are discussed and the results on a standard task are demonstrated.