Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or substitution, mostly result in changing the sentence meaning significantly and obtaining an incorrect example. Wordnets are potentially a perfect source of rich and high quality data that when integrated with the powerful capacity of generative models can help to solve this complex task. In this work, we use plWordNet, which is a wordnet of the Polish language, to explore the capability of encoder-decoder architectures in data augmentation of sense glosses. We discuss the limitations of generative methods and perform qualitative review of generated data samples.
Many knowledge-based solutions were proposed to solve Word Sense Disambiguation (WSD) problem with limited annotated resources. Such WSD algorithms are able to cover very large sense repositories, but still being outperformed by supervised ones on benchmark data. In this paper, we start with analysis identifying key properties and issues in application of spreading activation algorithms in knowledge-based WSD, e.g. influence of the network local structures, interaction with context information and sense frequency. Taking our observations as a point of departure, we introduce a novel solution with new context-to-sense matching using BERT embeddings, iterative parallel spreading activation function and selective sense alignment using contextual BERT embeddings. The proposed solution obtains performance beyond the state-of-the-art for the contemporary knowledge-based WSD approaches for both English and Polish data.
Focusing on recognition of multi-word expressions (MWEs), we address the problem of recording MWEs in WordNet. In fact, not all MWEs recorded in that lexical database could with no doubt be considered as lexicalised (e.g. elements of wordnet taxonomy, quantifier phrases, certain collocations). In this paper, we use a cross-encoder approach to improve our earlier method of distinguishing between lexicalised and non-lexicalised MWEs found in WordNet using custom-designed rule-based and statistical approaches. We achieve F1-measure for the class of lexicalised word combinations close to 80%, easily beating two baselines (random and a majority class one). Language model also proves to be better than a feature-based logistic regression model.
Derivational relations are an important element in defining meanings, as they help to explore word-formation schemes and predict senses of derivates (derived words). In this work, we analyse different methods of representing derivational forms obtained from WordNet – from quantitative vectors to contextual learned embedding methods – and compare ways of classifying the derivational relations occurring between them. Our research focuses on the explainability of the obtained representations and results. The data source for our research is plWordNet, which is the wordnet of the Polish language and includes a rich set of derivation examples.
Effective methods for multiword expressions detection are important for many technologies related to Natural Language Processing. Most contemporary methods are based on the sequence labeling scheme applied to an annotated corpus, while traditional methods use statistical measures. In our approach, we want to integrate the concepts of those two approaches. We present a novel weakly supervised multiword expressions extraction method which focuses on their behaviour in various contexts. Our method uses a lexicon of English multiword lexical units acquired from The Oxford Dictionary of English as a reference knowledge base and leverages neural language modelling with deep learning architectures. In our approach, we do not need a corpus annotated specifically for the task. The only required components are: a lexicon of multiword units, a large corpus, and a general contextual embeddings model. We propose a method for building a silver dataset by spotting multiword expression occurrences and acquiring statistical collocations as negative samples. Sample representation has been inspired by representations used in Natural Language Inference and relation recognition. Very good results (F1=0.8) were obtained with CNN network applied to individual occurrences followed by weighted voting used to combine results from the whole corpus. The proposed method can be quite easily applied to other languages.
The paper reports on the methodology and final results of a large-scale synset mapping between plWordNet and Princeton WordNet. Dedicated manual and semi-automatic mapping procedures as well as interlingual relation types for nouns, verbs, adjectives and adverbs are described. The statistics of all types of interlingual relations are also provided.
Neural language models, including transformer-based models, that are pre-trained on very large corpora became a common way to represent text in various tasks, including recognition of textual semantic relations, e.g. Cross-document Structure Theory. Pre-trained models are usually fine tuned to downstream tasks and the obtained vectors are used as an input for deep neural classifiers. No linguistic knowledge obtained from resources and tools is utilised. In this paper we compare such universal approaches with a combination of rich graph-based linguistically motivated sentence representation and a typical neural network classifier applied to a task of recognition of CST relation in Polish. The representation describes selected levels of the sentence structure including description of lexical meanings on the basis of the wordnet (plWordNet) synsets and connected SUMO concepts. The obtained results show that in the case of difficult relations and medium size training corpus semantically enriched text representation leads to significantly better results.
Relation Extraction is a fundamental NLP task. In this paper we investigate the impact of underlying text representation on the performance of neural classification models in the task of Brand-Product relation extraction. We also present the methodology of preparing annotated textual corpora for this task and we provide valuable insight into the properties of Brand-Product relations existing in textual corpora. The problem is approached from a practical angle of applications Relation Extraction in facilitating commercial Internet monitoring.
The study explores application of a simple Convolutional Neural Network for the problem of authorship attribution of tweets written in Polish. In our solution we use two-step compression of tweets using Byte Pair Encoding algorithm and vectorisation as an input to the distributional model generated for the large corpus of Polish tweets by word2vec algorithm. Our method achieves results comparable to the state-of-the-art approaches for the similar task on English tweets and expresses a very good performance in the classification of Polish tweets. We tested the proposed method in relation to the number of authors and tweets per author. We also juxtaposed results for authors with different topic backgrounds against each other.
Word Sense Disambiguation remains a challenging NLP task. Due to the lack of annotated training data, especially for rare senses, the supervised approaches are usually designed for specific subdomains limited to a narrow subset of identified senses. Recent advances in this area have shown that knowledge-based approaches are more scalable and obtain more promising results in all-words WSD scenarios. In this work we present a faster WSD algorithm based on the Monte Carlo approximation of sense probabilities given a context using constrained random walks over linked semantic networks. We show that the local semantic relatedness is mostly sufficient to successfully identify correct senses when an extensive knowledge base and a proper weighting scheme are used. The proposed methods are evaluated on English (SenseEval, SemEval) and Polish (Składnica, KPWr) datasets.
In this paper we present a morpho-syntactic tagger dedicated to Computer-mediated Communication texts in Polish. Its construction is based on an expanded RNN-based neural network adapted to the work on noisy texts. Among several techniques, the tagger utilises fastText embedding vectors, sequential character embedding vectors, and Brown clustering for the coarse-grained representation of sentence structures. In addition a set of manually written rules was proposed for post-processing. The system was trained to disambiguate descriptions of words in relation to Parts of Speech tags together with the full morphological information in terms of values for the different grammatical categories. We present also evaluation of several model variants on the gold standard annotated CMC data, comparison to the state-of-the-art taggers for Polish and error analysis. The proposed tagger shows significantly better results in this domain and demonstrates the viability of adaptation.
The paper presents the latest release of the Polish WordNet, namely plWordNet 4.1. The most significant developments since 3.0 version include new relations for nouns and verbs, mapping semantic role-relations from the valency lexicon Walenty onto the plWordNet structure and sense-level inter-lingual mapping. Several statistics are presented in order to illustrate the development and contemporary state of the wordnet.
In this paper, we compare a variety of sense-tagged sentiment resources, including SentiWordNet, ML-Senticon, plWordNet emo and the NTU Multilingual Corpus. The goal is to investigate the quality of the resources and see how well the sentiment polarity annotation maps across languages.
plWordNet, the wordnet of Polish, has become a very comprehensive description of the Polish lexical system. This paper presents a plan of its semi-automated integration with thesauri, terminological databases and ontologies, as a further necessary step in its development. This will improve linking of plWordNet into Linked Open Data, and facilitate applications in, e.g., WSD, keyword extraction or automated metadata generation. We present an overview of resources relevant to Polish and a plan for their linking to plWordNet.
The paper presents an expansion of the verb model for plWordNet – the wordnet of Polish. A modified system of constitutive features (register, aspect and verb classes), synset and lexical relations is presented. A special attention is given to the proposed new relations and changes in the verb classification. We discuss also the results of its verification by application to the description of a relatively large sample of Polish verbs. The model introduces a new class of relations, namely non-constitutive synset relations that are shared among lexical units, but describe, not define synsets. The proposed model is compared to the entailment relations in other wordnets, and the description of verbs based on valency frames.
The paper presents an approach to building a very large emotive lexicon for Polish based on plWordNet. An expanded annotation model is discussed, in which lexical units (word senses) are annotated with basic emotions, fundamental human values and sentiment polarisation. The annotation process is performed manually in the 2+1 scheme by pairs of linguists and psychologies. Guidelines referring to the usage in corpora, substitution tests as well linguistic properties of lexical units (e.g. derivational associations) are discussed. Application of the model in a substantial extension of the emotive annotation of plWordNet is presented. The achieved high inter-annotator agreement shows that with relatively small workload a promising emotive resource can be created.
The paper presents a new re-built and expanded, version 2.0 of WordnetLoom – an open wordnet editor. It facilitates work on a multilingual system of wordnets, is based on efficient software architecture of thin client, and offers more flexibility in enriching wordnet representation. This new version is built on the experience collected during the use of the previous one for more than 10 years of plWordNet development. We discuss its extensions motivated by the collected experience. A special focus is given to the development of a variant for the needs of MultiWordnet of Portuguese, which is based on a very different wordnet development model.
The paper presents a feature-based model of equivalence targeted at (manual) sense linking between Princeton WordNet and plWordNet. The model incorporates insights from lexicographic and translation theories on bilingual equivalence and draws on the results of earlier synset-level mapping of nouns between Princeton WordNet and plWordNet. It takes into account all basic aspects of language such as form, meaning and function and supplements them with (parallel) corpus frequency and translatability. Three types of equivalence are distinguished, namely strong, regular and weak depending on the conformity with the proposed features. The presented solutions are language-neutral and they can be easily applied to language pairs other than Polish and English. Sense-level mapping is a more fine-grained mapping than the existing synset mappings and is thus of great potential to human and machine translation.
The paper presents construction of large scale test datasets for word embeddings on the basis of a very large wordnet. They were next applied for evaluation of word embedding models and used to assess and compare the usefulness of different word embeddings extracted from a very large corpus of Polish. We analysed also and compared several publicly available models described in literature. In addition, several large word embeddings models built on the basis of a very large Polish corpus are presented.
Word embeddings were used for the extraction of hyponymy relation in several approaches, but also it was recently shown that they should not work, in fact. In our work we verified both claims using a very large wordnet of Polish as a gold standard for lexico-semantic relations and word embeddings extracted from a very large corpus of Polish. We showed that a hyponymy extraction method based on linear regression classifiers trained on clusters of vectors can be successfully applied on large scale. We presented also a possible explanation for contradictory findings in the literature. Moreover, in order to show the feasibility of the method we extended it to the recognition of meronymy.
In this paper we present a comprehensive overview of recent methods of the sentiment propagation in a wordnet. Next, we propose a fully automated method called Classifier-based Polarity Propagation, which utilises a very rich set of features, where most of them are based on wordnet relation types, multi-level bag-of-synsets and bag-of-polarities. We have evaluated our solution using manually annotated part of plWordNet 3.1 emo, which contains more than 83k manual sentiment annotations, covering more than 41k synsets. We have demonstrated that in comparison to existing rule-based methods using a specific narrow set of semantic relations our method has achieved statistically significant and better results starting with the same seed synsets.
This paper presents an supervised approach to the recognition of Cross-document Structure Theory (CST) relations in Polish texts. In the proposed, graph-based representation is constructed for sentences. Graphs are built on the basis of lexicalised syntactic-semantic relation extracted from text. Similarity between sentences is calculated from graph, and the similarity values are input to classifiers trained by Logistic Model Tree. Several different configurations of graph, as well as graph similarity methods were analysed for this tasks. The approach was evaluated on a large open corpus annotated manually with 17 types of selected CST relations. The configuration of experiments was similar to those known from SEMEVAL and we obtained very promising results.
In this article we present the result of the recent research in the recognition of genuine Polish suicide notes (SNs). We provide useful method to distinguish between SNs and other types of discourse, including counterfeited SNs. The method uses a wide range of word-based and semantic features and it was evaluated using Polish Corpus of Suicide Notes, which contains 1244 genuine SNs, expanded with manually prepared set of 334 counterfeited SNs and 2200 letter-like texts from the Internet. We utilized the algorithm to create the class-related sense dictionaries to improve the result of SNs classification. The obtained results show that there are fundamental differences between genuine SNs and counterfeited SNs. The applied method of the sense dictionary construction appeared to be the best way of improving the model.
The paper explores the application of plWordNet, a very large wordnet of Polish, in weakly supervised Word Sense Disambiguation (WSD). Because plWordNet provides only partial descriptions by glosses and usage examples, and does not include sense-disambiguated glosses, PageRank-based WSD methods perform slightly worse than for English. However, we show that the use of weights for the relation types and the order in which lexical units have been added for sense re-ranking can significantly improve WSD precision. The evaluation was done on two Polish corpora (KPWr and Składnica) including manual WSD. We discuss the fundamental difference in the construction of both corpora and very different test results.
It took us nearly ten years to get from no wordnet for Polish to the largest wordnet ever built. We started small but quickly learned to dream big. Now we are about to release plWordNet 3.0-emo – complete with sentiment and emotions annotated – and a domestic version of Princeton WordNet, larger than WordNet 3.1 by nearly ten thousand newly added words. The paper retraces the road we travelled and talks a little about the future.
We have released plWordNet 3.0, a very large wordnet for Polish. In addition to what is expected in wordnets – richly interrelated synsets – it contains sentiment and emotion annotations, a large set of multi-word expressions, and a mapping onto WordNet 3.1. Part of the release is enWordNet 1.0, a substantially enlarged copy of WordNet 3.1, with material added to allow for a more complete mapping. The paper discusses the design principles of plWordNet, its content, its statistical portrait, a comparison with similar resources, and a partial list of applications.
In this paper we study a rule-based approach to mapping plWordNet onto SUMO Upper Ontology on the basis of the already existing mappings: plWordNet – the Princeton WordNet – SUMO. Information acquired from the inter-lingual relations between plWordNet and Princeton WordNet and the relations between Princeton WordNet and SUMO ontology are used in the proposed rules. Several mapping rules together with the matching examples are presented. The automated mapping results were evaluated in two steps, (i) we automatically checked formal correctness of the mappings for the pairs of plWordNet synset and SUMO concept, (ii) a subset of 160 mapping examples was manually checked by two+one linguists. We analyzed types of the mapping errors and their causes. The proposed rules expressed very high precision, especially when the errors in the resources are taken into account. Because both wordnets were constructed independently and as a result the obtained rules are not trivial and they reveal the differences between both wordnets and both languages.
Building a wordnet is a serious undertaking. Fortunately, Language Technology (LT) can improve the process of wordnet construction both in terms of quality and cost. In this paper we present LT tools used during the construction of plWordNet and their influence on the lexicographer's work-flow. LT is employed in plWordNet development on every possible step: from data gathering through data analysis to data presentation. Nevertheless, every decision requires input from the lexicographer, but the quality of supporting tools is an important factor. Thus a limited evaluation of usefulness of employed tools is carried out on the basis of questionnaires.
The paper presents construction of \emph{Derywator} -- a language tool for the recognition of Polish derivational relations. It was built on the basis of machine learning in a way following the bootstrapping approach: a limited set of derivational pairs described manually by linguists in plWordNet is used to train \emph{Derivator}. The tool is intended to be applied in semi-automated expansion of plWordNet with new instances of derivational relations. The training process is based on the construction of two transducers working in the opposite directions: one for prefixes and one for suffixes. Internal stem alternations are recognised, recorded in a form of mapping sequences and stored together with transducers. Raw results produced by \emph{Derivator} undergo next corpus-based and morphological filtering. A set of derivational relations defined in plWordNet is presented. Results of tests for different derivational relations are discussed. A problem of the necessary corpus-based semantic filtering is analysed. The presented tool depends to a very little extent on the hand-crafted knowledge for a particular language, namely only a table of possible alternations and morphological filtering rules must be exchanged and it should not take longer than a couple of working days.
We present an approach to the description of Polish Multi-word Expressions (MWEs) which is based on expressions in the WCCL language of morpho-syntactic constraints instead of grammar rules or transducers. For each MWE its basic morphological form and the base forms of its constituents are specified but also each MWE is assigned to a class on the basis of its syntactic structure. For each class a WCCL constraint is defined which is parametrised by string variables referring to MWE constituent base forms or inflected forms. The constraint specifies a minimal set of conditions that must be fulfilled in order to recognise an occurrence of the given MWE in text with high accuracy. Our formalism is focused on the efficient description of large MWE lexicons for the needs of utilisation in text processing. The formalism allows for the relatively easy representation of flexible word order and discontinuous constructions. Moreover, there is no necessity for the full specification of the MWE grammatical structure. Only some aspects of the particular MWE structure can be selected in way facilitating the target accuracy of recognition. On the basis of a set of simple heuristics, WCCL-based representation of MWEs can be automatically generated from a list of MWE base forms. The proposed representation was applied on a practical scale for the description of a large set of Polish MWEs included in plWordNet.
Currently, research infrastructures are being designed and established in many disciplines since they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration and interoperability hurdles. CLARIN can build on knowledge and work from many projects that were carried out during the last years and wants to build stable and robust services that can be used by researchers. Here service centres will play an important role that have the potential of being persistent and that adhere to criteria as they have been established by CLARIN. In the last year of the so-called preparatory phase these centres are currently developing four use cases that can demonstrate how the various pillars CLARIN has been working on can be integrated. All four use cases fulfil the criteria of being cross-national.
A limited prototype of the CLARIN Language Technology Infrastructure (LTI) node is presented. The node prototype provides several types of web services for Polish. The functionality encompasses morpho-syntactic processing, shallow semantic processing of corpus on the basis of the SuperMatrix system and plWordNet browsing. We take the prototype as the starting point for the discussion on requirements that must be fulfilled by the LTI. Some possible solutions are proposed for less frequently discussed problems, e.g. streaming processing of language data on the remote processing node. We experimentally investigate how to tackle with several requirements from many discussed. Such aspects as processing large volumes of data, asynchronous mode of processing and scalability of the architecture to large number of users got especial attention in the constructed prototype of the Web Service for morpho-syntactic processing of Polish called TaKIPI-WS (http://plwordnet.pwr.wroc.pl/clarin/ws/takipi/). TaKIPI-WS is a distributed system with a three-layer architecture, an asynchronous model of request handling and multi-agent-based processing. TaKIPI-WS consists of three layers: WS Interface, Database and Daemons. The role of the Database is to store and exchange data between the Interface and the Daemons. The Daemons (i.e. taggers) are responsible for executing the requests queued in the database. Results of the performance tests are presented in the paper, too.
The construction of a wordnet, a labour-intensive enterprise, can be significantly assisted by automatic grouping of lexical material and discovery of lexical semantic relations. The objective is to ensure high quality of automatically acquired results before they are presented for lexicographers approval. We discuss a software tool that suggests synset members using a measure of semantic relatedness with a given verb or adjective; this extends previous work on nominal synsets in Polish WordNet. Syntactically-motivated constraints are deployed on a large morphologically annotated corpus of Polish. Evaluation has been performed via the WordNet-Based Similarity Test and additionally supported by human raters. A lexicographer also manually assessed a suitable sample of suggestions. The results compare favourably with other known methods of acquiring semantic relations.