Natalia Loukachevitch

Also published as: N. Loukachevitch, Natalia V. Loukachevitch

2024

pdf abs
Biomedical Concept Normalization over Nested Entities with Partial UMLS Terminology in Russian
Natalia Loukachevitch | Andrey Sakhovskiy | Elena Tutubalina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present a new manually annotated dataset of PubMed abstracts for concept normalization in Russian. It contains over 23,641 entity mentions in 756 documents linked to 4,544 unique concepts from the UMLS ontology. Compared to existing corpora, we explore two novel annotation characteristics: the nestedness of named entities and the incompleteness of the Russian medical terminology in UMLS. 4,424 entity mentions are linked to 1,535 unique English concepts absent in the Russian part of the UMLS ontology. We present several baselines for normalization over nested named entities obtained with state-of-the-art models such as SapBERT. Our experimental results show that models pre-trained on graph structural data from UMLS achieve superior performance in a zero-shot setting on bilingual terminology.

2022

In this paper, we describe entity linking annotation over nested named entities in the recently released Russian NEREL dataset for information extraction. The NEREL collection is currently the largest Russian dataset annotated with entities and relations. It includes 933 news texts with annotation of 29 entity types and 49 relation types. The paper describes the main design principles behind NEREL’s entity linking annotation, provides its statistics, and reports evaluation results for several entity linking baselines. To date, 38,152 entity mentions in 933 documents are linked to Wikidata. The NEREL dataset is publicly available.

pdf abs
Sense-Annotated Corpus for Russian
Alexander Kirillovich | Natalia Loukachevitch | Maksim Kulaev | Angelina Bolshina | Dmitry Ilvovsky
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

We present a sense-annotated corpus for Russian. The resource was obtained my manually annotating texts from the OpenCorpora corpus, an open corpus for the Russian language, by senses of Russian wordnet RuWordNet. The annotation was used as a test collection for comparing unsupervised (Personalized Pagerank) and pseudo-labeling methods for Russian word sense disambiguation.

2021

pdf abs
Evaluation of Taxonomy Enrichment on Diachronic WordNet Versions
Irina Nikishina | Natalia Loukachevitch | Varvara Logacheva | Alexander Panchenko
Proceedings of the 11th Global Wordnet Conference

The vast majority of the existing approaches for taxonomy enrichment apply word embeddings as they have proven to accumulate contexts (in a broad sense) extracted from texts which are sufficient for attaching orphan words to the taxonomy. On the other hand, apart from being large lexical and semantic resources, taxonomies are graph structures. Combining word embeddings with graph structure of taxonomy could be of use for predicting taxonomic relations. In this paper we compare several approaches for attaching new words to the existing taxonomy which are based on the graph representations with the one that relies on fastText embeddings. We test all methods on Russian and English datasets, but they could be also applied to other wordnets and languages.

pdf abs
Comparing Similarity of Words Based on Psychosemantic Experiment and RuWordNet
Valery Solovyev | Natalia Loukachevitch
Proceedings of the 11th Global Wordnet Conference

In the paper we compare the structure of the Russian language thesaurus RuWordNet with the data of a psychosemantic experiment to identify semantically close words. The aim of the study is to find out to what extent the structure of RuWordNet corresponds to the intuitive ideas of native speakers about the semantic proximity of words. The respondents were asked to list synonyms to a given word. As a result of the experiment, we found that the respondents mainly mentioned not only synonyms but words that are in paradigmatic relations with the stimuli. The words of the mental sphere were chosen for the experiment. In 95% of cases, the words characterized in the experiment as semantically close were also close according to the thesaurus. In other cases, additions to the thesaurus were proposed.

In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse level. NEREL can facilitate development of novel models that can extract relations between nested named entities, as well as relations on both sentence and document levels. NEREL also contains the annotation of events involving named entities and their roles in the events. The NEREL collection is available via https://github.com/nerel-ds/NEREL.

2020

pdf abs
Comparison of Genres in Word Sense Disambiguation using Automatically Generated Text Collections
Angelina Bolshina | Natalia Loukachevitch
Proceedings of the 4th International Conference on Computational Linguistics in Bulgaria (CLIB 2020)

The best approaches in Word Sense Disambiguation (WSD) are supervised and rely on large amounts of hand-labelled data, which is not always available and costly to create. In our work we describe an approach that is used to create an automatically labelled collection based on the monosemous relatives (related unambiguous entries) for Russian. The main contribution of our work is that we extracted monosemous relatives that can be located at relatively long distances from a target ambiguous word and ranked them according to the similarity measure to the target sense. We evaluated word sense disambiguation models based on a nearest neighbour classification on BERT and ELMo embeddings and two text collections. Our work relies on the Russian wordnet RuWordNet.

pdf abs
Studying Taxonomy Enrichment on Diachronic WordNet Versions
Irina Nikishina | Varvara Logacheva | Alexander Panchenko | Natalia Loukachevitch
Proceedings of the 28th International Conference on Computational Linguistics

Ontologies, taxonomies, and thesauri have always been in high demand in a large number of NLP tasks. However, most studies are focused on the creation of lexical resources rather than the maintenance of the existing ones and keeping them up-to-date. In this paper, we address the problem of taxonomy enrichment. Namely, we explore the possibilities of taxonomy extension in a resource-poor setting and present several methods which are applicable to a large number of languages. We also create novel English and Russian datasets for training and evaluating taxonomy enrichment systems and describe a technique of creating such datasets for other languages.

2019

pdf abs
Thesaurus Verification Based on Distributional Similarities
Natalia Loukachevitch | Ekaterina Parkhomenko
Proceedings of the 10th Global Wordnet Conference

In this paper we consider an approach to verification of large lexical-semantic resources as WordNet. The method of verification procedure is based on the analysis of discrepancies of corpus-based and thesaurus-based word similarities. We calculated such word similarities on the basis of a Russian news collection and Russian wordnet (RuWordNet). We applied the procedure to more than 30 thousand words and found some serious errors in word sense description, including incorrect or absent relations or missed main senses of ambiguous words.

pdf abs
Linking Russian Wordnet RuWordNet to WordNet
Natalia Loukachevitch | Anastasia Gerasimova
Proceedings of the 10th Global Wordnet Conference

In this paper we consider the linking procedure of Russian wordnet (RuWordNet) to Wordnet. The specificity of the procedure in our case is based on the fact that a lot of bilingual (Russian and English) lexical data have been gathered in another Russian thesaurus RuThes, which has a different structure than WordNet. Previously, RuThes has been semi-automatically transformed into RuWordNet, having the WordNet-like structure. Now, the RuThes English data are utilized to establish matching from the RuWordNet synsets to the WordNet synsets.

pdf abs
Corpus-based Check-up for Thesaurus
Natalia Loukachevitch
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper we discuss the usefulness of applying a checking procedure to existing thesauri. The procedure is based on the analysis of discrepancies of corpus-based and thesaurus-based word similarities. We applied the procedure to more than 30 thousand words of the Russian wordnet and found some serious errors in word sense description, including inaccurate relationships and missing senses of ambiguous words.

pdf abs
Distant Supervision for Sentiment Attitude Extraction
Nicolay Rusnachenko | Natalia Loukachevitch | Elena Tutubalina
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

News articles often convey attitudes between the mentioned subjects, which is essential for understanding the described situation. In this paper, we describe a new approach to distant supervision for extracting sentiment attitudes between named entities mentioned in texts. Two factors (pair-based and frame-based) were used to automatically label an extensive news collection, dubbed as RuAttitudes. The latter became a basis for adaptation and training convolutional architectures, including piecewise max pooling and full use of information across different sentences. The results show that models, trained with RuAttitudes, outperform ones that were trained with only supervised learning approach and achieve 13.4% increase in F1-score on RuSentRel collection.

pdf abs
Named Entity Recognition in Information Security Domain for Russian
Anastasiia Sirotina | Natalia Loukachevitch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper we discuss the named entity recognition task for Russian texts related to cybersecurity. First of all, we describe the problems that arise in course of labeling unstructured texts from information security domain. We introduce guidelines for human annotators, according to which a corpus has been marked up. Then, a CRF-based system and different neural architectures have been implemented and applied to the corpus. The named entity recognition systems have been evaluated and compared to determine the most efficient one.

2018

pdf abs
Comparing Two Thesaurus Representations for Russian
Natalia Loukachevitch | German Lashevich | Boris Dobrov
Proceedings of the 9th Global Wordnet Conference

In the paper we presented a new Russian wordnet, RuWordNet, which was semi-automatically obtained by transformation of the existing Russian thesaurus RuThes. At the first step, the basic structure of wordnets was reproduced: synsets’ hierarchy for each part of speech and the basic set of relations between synsets (hyponym-hypernym, part-whole, antonyms). At the second stage, we added causation, entailment and domain relations between synsets. Also derivation relations were established for single words and the component structure for phrases included in RuWordNet. The described procedure of transformation highlights the specific features of each type of thesaurus representations.

2017

pdf abs
Human Associations Help to Detect Conventionalized Multiword Expressions
Natalia Loukachevitch | Anastasia Gerasimova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper we show that if we want to obtain human evidence about conventionalization of some phrases, we should ask native speakers about associations they have to a given phrase and its component words. We have shown that if component words of a phrase have each other as frequent associations, then this phrase can be considered as conventionalized. Another type of conventionalized phrases can be revealed using two factors: low entropy of phrase associations and low intersection of component word and phrase associations. The association experiments were performed for the Russian language.

2016

pdf
Accounting ngrams and multi-word terms can improve topic models
Michael Nokel | Natalia Loukachevitch
Proceedings of the 12th Workshop on Multiword Expressions

pdf abs
Creating a General Russian Sentiment Lexicon
Natalia Loukachevitch | Anatolii Levchik
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper describes the new Russian sentiment lexicon - RuSentiLex. The lexicon was gathered from several sources: opinionated words from domain-oriented Russian sentiment vocabularies, slang and curse words extracted from Twitter, objective words with positive or negative connotations from a news collection. The words in the lexicon having different sentiment orientations in specific senses are linked to appropriate concepts of the thesaurus of Russian language RuThes. All lexicon entries are classified according to four sentiment categories and three sources of sentiment (opinion, emotion, or fact). The lexicon can serve as the first version for the construction of domain-specific sentiment lexicons or can be used for feature generation in machine-learning approaches. In this role, the RuSentiLex lexicon was utilized by the participants of the SentiRuEval-2016 Twitter reputation monitoring shared task and allowed them to achieve high results.

2015

pdf bib
A Method of Accounting Bigrams in Topic Models
Michael Nokel | Natalia Loukachevitch
Proceedings of the 11th Workshop on Multiword Expressions

pdf
Topic Models: Accounting Component Structure of Bigrams
Michael Nokel | Natalia Loukachevitch
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf
Determining the most frequent senses using Russian linguistic ontology RuThes
Natalia Loukachevitch | Ilia Chetviorkin
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

pdf
Types of Aspect Terms in Aspect-Oriented Sentiment Labeling
Natalia Loukachevitch | Evgeniy Kotelnikov | Pavel Blinov
The 5th Workshop on Balto-Slavic Natural Language Processing

2014

pdf abs
Summarizing News Clusters on the Basis of Thematic Chains
Natalia Loukachevitch | Aleksey Alekseev
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we consider a method for extraction of sets of semantically similar language expressions representing different partici-pants of the text story ― thematic chains. The method is based on the structural organization of news clusters and exploits comparison of various contexts of words. The word contexts are used as a basis for extracting multiword expressions and constructing thematic chains. The main difference of thematic chains in comparison with lexical chains is the basic principle of their construction: thematic chains are intended to model different participants (concrete or abstract) of the situation described in the analyzed texts, what means that elements of the same thematic chain cannot often co-occur in the same sentences of the texts under consideration. We evaluate our method on the multi-document summarization task

pdf
RuThes Linguistic Ontology vs. Russian Wordnets
Natalia Loukachevitch | Boris Dobrov
Proceedings of the Seventh Global Wordnet Conference

pdf
Two-Step Model for Sentiment Lexicon Extraction from Twitter Streams
Ilia Chetviorkin | Natalia Loukachevitch
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2013

pdf
Evaluating Sentiment Analysis Systems in Russian
Ilia Chetviorkin | Natalia Loukachevitch
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

2012

pdf
Extraction of Russian Sentiment Lexicon for Product Meta-Domain
Ilia Chetviorkin | Natalia Loukachevitch
Proceedings of COLING 2012

pdf
DomEx: Extraction of Sentiment Lexicons for Domains and Meta-Domains
Ilia Chetviorkin | Natalia Loukachevitch
Proceedings of COLING 2012: Demonstration Papers

pdf abs
Automatic Term Recognition Needs Multiple Evidence
Natalia Loukachevitch
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we argue that the automatic term extraction procedure is an inherently multifactor process and the term extraction models needs to be based on multiple features including a specific type of a terminological resource under development. We proposed to use three types of features for extraction of two-word terms and showed that all these types of features are useful for term extraction. The set of features includes new features such as features extracted from an existing domain-specific thesaurus and features based on Internet search results. We studied the set of features for term extraction in two different domains and showed that the combination of several types of features considerably enhances the quality of the term extraction procedure. We found that for developing term extraction models in a specific domain, it is important to take into account some properties of the domain.

2011

pdf
Multiple Evidence for Term Extraction in Broad Domains
Boris Dobrov | Natalia Loukachevitch
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf
Extraction of Domain-specific Opinion Words for Similar Domains
Ilia Chetviorkin | Natalia Loukachevitch
Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition

2006

pdf abs
Development of Linguistic Ontology on Natural Sciences and Technology
B. Dobrov | N. Loukachevitch
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper describes the main principles of development and current state of Linguistic Ontology on Natural Sciences and Technology intended for information-retrieval tasks. In the development of the ontology we combined three different methodologies: development of information-retrieval thesauri, development of wordnets, formal ontology research. Combination of these methodologies allows us to develop large ontologies for broad domains.

2004

pdf abs
Development of Bilingual Domain-Specific Ontology for Automatic Conceptual Indexing
Natalia V. Loukachevitch | Boris V. Dobrov
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

In the paper we describe development, means of evaluation and applications of Russian-English Sociopolitical Thesaurus specially developed as a linguistic resource for automatic text processing applications. The Sociopolitical domain is not a domain of social research but a broad domain of social relations including economic, political, military, cultural, sports and other subdomains. The knowledge of this domain is necessary for automatic text processing of such important documents as official documents, legislative acts, newspaper articles.

pdf abs
Development of Ontologies with Minimal Set of Conceptual Relations
Natalia V. Loukachevitch | Boris V. Dobrov
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

In the paper we describe our approach to development of ontologies with small number of relation types. Non-taxonomic relations in our ontologies are based on ontological dependence conception described in the formal ontology. This minimal relations set does not depend on a domain or a task and makes possible to begin the ontology construction at once, as soon as a task is set and a domain is determined, to receive the first version of an ontology in short time. Such an initial ontology can be used for information-retrieval applications and can serve as a structural basis for further development of the ontology

pdf
Russian Information Retrieval Evaluation Seminar
Boris Dobrov | Igor Kuralenok | Natalia Loukachevitch | Igor Nekrestyanov | Ilya Segalovich
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)