- Anthology ID:
- Wroclaw, Poland
- Global Wordnet Association
The schema.org initiative was designed to introduce machine readable metadata into the World Wide Web. This paper investigates conceptual biases in the schema through a mapping exercise between schema.org types and WordNet synsets. We create a mapping ontology which establishes the relationship between schema metadata types and the corresponding everyday concepts. This in turn can be used to enhance metadata annotation to include a more complete description of knowledge on the Web of data.
We present here the enhancement of the Romanian wordnet with a new type of information, very useful in language processing, namely types of verbal multi-word expressions. All verb literals made of two or more words are attached a label specific to the type of verbal multi-word expression they correspond to. These labels were created in the PARSEME Cost Action and were used in the version 1.1 of the shared task they organized. The results of this annotation are compared to those obtained in the annotation of a Romanian news corpus with the same labels. Given the alignment of the Romanian wordnet to the Princeton WordNet, this type of annotation can be further used for drawing comparisons between equivalent verbal literals in various languages, provided that such information is annotated in the wordnets of the respective languages and their wordnets are aligned to Princeton WordNet, and thus to the Romanian wordnet.
In this paper we consider an approach to verification of large lexical-semantic resources as WordNet. The method of verification procedure is based on the analysis of discrepancies of corpus-based and thesaurus-based word similarities. We calculated such word similarities on the basis of a Russian news collection and Russian wordnet (RuWordNet). We applied the procedure to more than 30 thousand words and found some serious errors in word sense description, including incorrect or absent relations or missed main senses of ambiguous words.
GermaNet (Henrich and Hinrichs, 2010; Hamp and Feldweg, 1997) is a comprehensive wordnet of Standard German spoken in the Federal Republic of Germany. The GermaNet team aims at modelling the basic vocabulary of the language. German is an official language or a minority language in many countries. It is an official language in Austria, Germany and Switzerland, each with its own codified standard variety (Auer, 2014, p. 21), and also in Belgium, Liechtenstein, and Luxemburg. German is recognized as a minority language in thirteen additional countries, including Brasil, Italy, Poland, and Russia. However, the different standard varieties of German are currently not represented in GermaNet. With this project, we make a start on changing this by including one variety, namely Swiss Standard German, into GermaNet. This shall give a more inclusive perspective on the German language. We will argue that Swiss Standard German words, Helvetisms, are best included into the already existing wordnet GermaNet, rather than creating them as a separate wordnet.
Wikidata introduced support for lexicographic data in 2018. Here we describe the lexicographic part of Wikidata as well as experiences with setting up lexemes for the Danish language. We note various possible annotations for lexemes as well as discuss various choices made.
Semantic information about entities, specifically, how close in meaning two mentions are to each other, can become very useful for the task of co-reference resolution. One of the most well-researched and widely used forms of presenting this information are measures of semantic similarity and semantic relatedness. These metrics are often computed, relying upon the structure of a thesaurus, but it is also possible to use alternative resources. One such source is Wikipedia, which possesses the category structure similar to that of a thesaurus. In this work we describe an attempt to use semantic relatedness measures, calculated on thesaurus and Wikipedia data, to improve the quality of a co-reference resolution system for Russian language. The results show that this is a viable solution and that combining the two sources yields the most gain in quality.
Arabic WordNet (AWN) represents one of the best-known lexical resources for the Arabic language. However, it contains various issues that affect its use in different Natural Language Processing (NLP) applications. Due to resources deficiency, the update of Arabic WordNet requires much effort. There have only been only two updates it was first published in 2006. The most significant of those being in 2013, which represented a significant development in the usability and coverage of Arabic WordNet. This paper provides a study case on the updates of the Arabic WordNet and the development of its contents. More precisely, we present the new content in terms of relations that have been added to the extended version of Arabic WordNet. We also validate and evaluate its contents at different levels. We use its different versions in a Word Sense Disambiguation system. Finally, we compare the results and evaluate them. Results show that newly added semantic relations can improve the performance of a Word Sense Disambiguation system.
The paper presents an effort on transferability of noun–verb and noun–adjective derivative and semantic relations to noun-noun relations. The approach relies on information from semantic classes and existing inter-POS derivative and (morpho)semantic relations between noun and verb, and noun and adjective synsets. We have added semantic relations between nouns in WordNet that are indirectly linked via verbs and adjectives. Observations on the combination between the relations and semantic classes of nouns they link, may facilitate further efforts in assigning semantic properties to nouns pointing to their abilities to participate in predicate-argument structures.
In this paper we consider the linking procedure of Russian wordnet (RuWordNet) to Wordnet. The specificity of the procedure in our case is based on the fact that a lot of bilingual (Russian and English) lexical data have been gathered in another Russian thesaurus RuThes, which has a different structure than WordNet. Previously, RuThes has been semi-automatically transformed into RuWordNet, having the WordNet-like structure. Now, the RuThes English data are utilized to establish matching from the RuWordNet synsets to the WordNet synsets.
We describe how a natural language interface can be developed for a wordnet with a small set of handcrafted templates, leveraging on sentence embeddings. The proposed approach does not use rules for parsing natural language queries but experiments showed that the embeddings model is tolerant enough for correctly predicting relation types that do not match known patterns exactly. It was tested with OpenWordNet-PT, for which this method may provide an alternative interface, with benefits also on the curation process.
In this paper, we investigate mapping of the WORDNET hyponymy relation to feature vectors. Our aim is to model lexical knowledge in such a way that it can be used as input in generic machine-learning models, such as phrase entailment predictors. We propose two models. The first one leverages an existing mapping of words to feature vectors (fastText), and attempts to classify such vectors as within or outside of each class. The second model is fully supervised, using solely WORDNET as a ground truth. It maps each concept to an interval or a disjunction thereof. The first model approaches but not quite attain state of the art performance. The second model can achieve near-perfect accuracy.
This paper proposes a framework for investigating which types of semantic properties are represented by distributional data. The core of our framework consists of relations between concepts and properties. We provide hypotheses on which properties are reflected in distributional data or not based on the type of relation. We outline strategies for creating a dataset of positive and negative examples for various semantic properties, which cannot easily be separated on the basis of general similarity (e.g. fly: seagull, penguin). This way, a distributional model can only distinguish between positive and negative examples through evidence for a target property. Once completed, this dataset can be used to test our hypotheses and work towards data-derived interpretable representations.
In this paper we discuss how Walenty is using PLWORDNET to represent semantic information. We decided to use PLWORDNET lexical units and synsets to describe both the predicate meaning and the semantic fields of its arguments. The original design decision required some further refinement caused by the structure of PLWORDNET and complex relations between arguments.
In this article, we tackle the issue of the limited quantity of manually sense annotated corpora for the task of word sense disambiguation, by exploiting the semantic relationships between senses such as synonymy, hypernymy and hyponymy, in order to compress the sense vocabulary of Princeton WordNet, and thus reduce the number of different sense tags that must be observed to disambiguate all words of the lexical database. We propose two different methods that greatly reduce the size of neural WSD models, with the benefit of improving their coverage without additional training data, and without impacting their precision. In addition to our methods, we present a WSD system which relies on pre-trained BERT word vectors in order to achieve results that significantly outperforms the state of the art on all WSD evaluation tasks.
We propose a new algorithm for word sense disambiguation, exploiting data from a WordNet with many types of lexical relations, such as plWordNet for Polish. In this method, sense probabilities in context are approximated with a language model. To estimate the likelihood of a sense appearing amidst the word sequence, the token being disambiguated is substituted with words related lexically to the given sense or words appearing in its WordNet gloss. We test this approach on a set of sense-annotated Polish sentences with a number of neural language models. Our best setup achieves the accuracy score of 55.12% (72.02% when first senses are excluded), up from 51.77% of an existing PageRank-based method. While not exceeding the first (often meaning most frequent) sense baseline in the standard case, this encourages further research on combining WordNet data with neural models.
In this paper we describe the merge of the Danish wordnet, DanNet, with Princeton Wordnet applying a two-step approach. We first link from the English Princeton core to Danish (5,000 base concepts) and then proceed to linking the rest of the Danish vocabulary to English, thus going from Danish to English. Since the Danish wordnet is built bottom-up from Danish lexica and corpora, all taxonomies are monolingually based and thus not necessarily directly compatible with the coverage and structure of the Princeton WordNet. This fact proves to pose some challenges to the linking procedure since a considerable number of the links cannot be realised via the preferred cross-language synonym link which implies a more or less precise correlation between the two concepts. Instead, a subpart of the links are realised through near synonym or hyponymy links to compensate for the fact that no precise translation can be found in the target resource. The tool WordnetLoom is currently used for manual linking but procedures for a more automatic procedure in future is discussed. We conclude that the two resources actually differ from each other quite more than expected, both vocabulary and structure-wise.
Stemming is a technique that reduces any inflected word to its root form. Assamese is a morphologically rich, scheduled Indian language. There are various forms of suffixes applied to a word in various contexts. Such inflected words if normalized will help improve the performance of various Natural Language Processing applications. This paper basically tries to develop a Look-up and rule-based suffix stripping approach for the Assamese language using WordNet. The authors prepare the dictionary with the root words extracted from Assamese WordNet and Named Entities. Appropriate stemming rules for the inflected nouns, verbs have been set to the rule engine and later tested the stemmed output with the morphological root words of Assamese WordNet and Named Entities by computing hamming distance. This developed stemmer for the Assamese language achieves accuracy of 85%. Also, the authors reported the IR system’s performance on applying the Assamese stemmer and proved its efficiency by retrieving sense oriented results based on the fired query. Thus, Morphological Analyzer will embark the research wing for developing various Assamese NLP applications.
Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find that the distributions in the psuedo-corpora exhibit properties found in natural corpora, such as Zipf’s and Heaps’ law, and also observe that the proportion of rare words in a pseudo-corpus affects the performance of its embeddings on word similarity.
Constructing semantic relations in WordNet has been a labour-intensive task, especially in a dynamic and fast-changing language environment. Combined with recent advancements of contextualized embeddings, this paper proposes the concept of morphology-guided sense vectors, which can be used to semi-automatically augment semantic relations in Chinese Wordnet (CWN). This paper (1) built sense vectors with pre-trained contextualized embedding models; (2) demonstrated the sense vectors computed were consistent with the sense distinctions made in CWN; and (3) predicted the potential semantically-related sense pairs with high accuracy by sense vectors model.
AutoExtend is a method for learning unambiguous vector embeddings for word senses. We visualise these word embeddings with t-SNE, which further compresses the vectors to the x,y plane. We show that the t-SNE co-ordinates can be used to reveal interesting semantic relations between word senses, and propose a new method that uses the simple x,y coordinates to compute semantic similarity. This can be used to propose new links and alterations to existing ones in WordNet. We plan to add this approach to the existing toolbox of methods in an attempt to understand learned semantic relations in word embeddings.
With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.
Within a larger frame of facilitating human-robot interaction, we present here the creation of a core vocabulary to be learned by a robot. It is extracted from two tokenised and lemmatized scenarios pertaining to two imagined microworlds in which the robot is supposed to play an assistive role. We also evaluate two resources for their utility for expanding this vocabulary so as to better cope with the robot’s communication needs. The language under study is Romanian and the resources used are the Romanian wordnet and word embedding vectors extracted from the large representative corpus of contemporary Romanian, CoRoLa. The evaluation is made for two situations: one in which the words are not semantically disambiguated before expanding the lexicon, and another one in which they are disambiguated with senses from the Romanian wordnet. The appropriateness of each resource is discussed.
This paper describes our project on Japanese compound verbs. Japanese “Verb (adnominal form) + Verb” compounds, which are treated as single verbs, frequently appear in daily communication. They are not sufficiently registered in Japanese dictionaries or thesauri. We are now compiling a list of the synonymous expressions of compound verbs in “compound verb lexicon” built by the National Institute of Japanese Language and Linguistics. We extracted synonymous words and phrases of compound verbs from five hundred million Japanese web corpora. As a result, synonymous expressions of 1800 compound verbs were obtained automatically among 2700 in the “compound verb lexicon”. From our data, we observed that some compound verbs represent not only motion but also additional nuances such as an emotional one. In order to reflect the abundant meanings that compound verbs own, we will try to think of a link of synonymous expressions to Japanese wordnet. Concretely, in the case of synonymous phrases, we try to link adverbial expressions which are a part of phrases to the adverbial synset in Japanese wordnet.
The African Wordnet Project (AWN) includes all nine indigenous South African languages, namely isiZulu, isiXhosa, Setswana, Sesotho sa Leboa, Tshivenda, Siswati, Sesotho, isiNdebele and Xitsonga. The AWN currently includes 61 000 synsets as well as definitions and usage examples for a large part of the synsets. The project recently received extended funding from the South African Centre for Digital Language Resources (SADiLaR) and aims to update all aspects of the current resource, including the seed list used for new development, software tools used and mapping the AWN to the latest version of PWN 3.1. As with any resource development project, it is essential to also include phases of focused quality assurance and updating of the basis on which the resource is built. The African languages remain under-resourced. This paper describes progress made in the development of the AWN as well as recent technical improvements.
We describe a detailed analysis of a sample of large benchmark of commonsense reasoning problems that has been automatically obtained from WordNet, SUMO and their mapping. The objective is to provide a better assessment of the quality of both the benchmark and the involved knowledge resources for advanced commonsense reasoning tasks. By means of this analysis, we are able to detect some knowledge misalignments, mapping errors and lack of knowledge and resources. Our final objective is the extraction of some guidelines towards a better exploitation of this commonsense knowledge framework by the improvement of the included resources.
This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Hong Kong Cantonese. It is built using the expansion approach, leveraging on the existing Chinese Open Wordnet, and the Princeton Wordnet’s semantic hierarchy. The main goal of our project was to produce a high quality, human-curated resource – and this paper reports on the initial efforts and steady progress of our building method. It is our belief that the lexical data made available by this wordnet, including Jyutping romanization, will be useful for a variety of future uses, including many language processing tasks and linguistic research on Cantonese and its interactions with other Chinese dialects.
Event detection is an important NLP task that has been only recently tackled in the context of Polish, mostly due to lack of language resources. The available annotated corpora are still relatively small and supervised learning approaches are limited by the size of training datasets. Event detection tools are very much needed, as they can be used to annotate more language resources automatically and to improve the accuracy of other NLP tasks, which rely on the detection of events, such as question answering or machine translation. In this paper we present a deep learning based approach to this task, which proved to capture the knowledge contained in the training data most effectively and outperform previously proposed methods. We show a direct comparison to previously published results, using the same data and experimental setup.
When teaching language for specific purposes (LSP) linguistic resources are needed to help students understand and write specialised texts. As building a lexical resource is costly, we explore the use of wordnets to represent the terms that can be found in particular textual domains. In order to gather the terms to be included in wordnets, we propose a textual genre approach, that leads us to introduce a new relation term used in to link all the possible terms/synsets that can appear in a text to the synset of the textual genre. This way, students can use wordnet as dictionary or thesaurus when writing specialised texts. We explain our approach by means of the logbooks and terms in Basque. A side effect of this works is also enriching the wordnets with new variants and synsets.
We fit WordNet relations to word embeddings, using 3CosAvg and LRCos, two set-based methods for analogy resolution, and introduce 3CosWeight, a new, weighted variant of 3CosAvg. We test the performance of the resulting semantic vectors in lexicographic semantics tests, and show that none of the tested classifiers can learn symmetric relations like synonymy and antonymy, since the source and target words of these relations are the same set. By contrast, with the asymmetric relations (hyperonymy / hyponymy and meronymy), both 3CosAvg and LRCos clearly outperform the baseline in all cases, while 3CosWeight attained the best scores with hyponymy and meronymy, suggesting that this new method could provide a useful alternative to previous approaches.
This paper presents the Mongolian Wordnet (MOW), and a general methodology of how to construct it from various sources e.g. lexical resources and expert translations. As of today, the MOW contains 23,665 synsets, 26,875 words, 2,979 glosses, and 213 examples. The manual evaluation of the resource1 estimated its quality at 96.4%.
We describe the release of a new wordnet for English based on the Princeton WordNet, but now developed under an open-source model. In particular, this version of WordNet, which we call English WordNet 2019, which has been developed by multiple people around the world through GitHub, fixes many errors in previous wordnets for English. We give some details of the changes that have been made in this version and give some perspectives about likely future changes that will be made as this project continues to evolve.
An effective conversion method was proposed in the literature to obtain a lexical semantic space from a lexical semantic graph, thus permitting to obtain WordNet embeddings from WordNets. In this paper, we propose the exploitation of this conversion methodology as the basis for the comparative assessment of WordNets: given two WordNets, their relative quality in terms of capturing the lexical semantics of a given language, can be assessed by (i) converting each WordNet into the corresponding semantic space (i.e. into WordNet embeddings), (ii) evaluating the resulting WordNet embeddings under the typical semantic similarity prediction task used to evaluate word embeddings in general; and (iii) comparing the performance in that task of the two word embeddings, extracted from the two WordNets. A better performance in that evaluation task results from the word embeddings that are better at capturing the semantic similarity of words, which, in turn, result from the WordNet that is of higher quality at capturing the semantics of words.
WordNets have been used in a wide variety of applications, including in design and development of intelligent and human assisting systems. Although WordNet was initially developed as an online lexical database, (Miller, 1995 and Fellbaum, 1998) later developments have inspired using WordNet database as resources in NLP applications, Language Technology developments, and as sources of structured learned materials. This paper proposes, conceptualizes, designs, and develops a voice enabled information retrieval system, facilitating WordNet knowledge presentation in a spoken format, based on a spoken query. In practice, the work converts the WordNet resource into a structured voiced based knowledge extraction system, where a spoken query is processed in a pipeline, and then extracting the relevant WordNet resources, structuring through another process pipeline, and then presented in spoken format. Thus the system facilitates a speech interface to the existing WordNet and we named the system as “Spoken WordNet”. The system interacts with two interfaces, one designed and developed for Web, and the other as an App interface for smartphone. This is also a kind of restructuring the WordNet as a friendly version for visually challenged users. User can input query string in the form of spoken English sentence or word. Jaccard Similarity is calculated between the input sentence and the synset definitions. The one with highest similarity score is taken as the synset of interest among multiple available synsets. User is also prompted to choose a contextual synset, in case of ambiguities.
In this paper we describe our current work on representing a recently created German lexical semantics resource in OntoLex-Lemon and in conformance with WordNet specifications. Besides presenting the representation effort, we show the utilization of OntoLex-Lemon to bridge from WordNet-like resources to full lexical descriptions and extend the coverage of WordNets to other types of lexical data, such as decomposition results, exemplified for German data, and inflectional phenomena, here outlined for English data.
In this paper, we present semi-automatic annotation of the Event Structure Frames to synsets of English verbs in WordNet. The Event Structure Frame is a sub-eventual structure frame which combines event structure (lexical aspect) with argument structure represented by semantic roles and opposition structure which represents the presupposed and entailed sub-events of a matrix event. Our annotation work is done semi-automatically by GESL-based automatic annotation and manual error-correction. GESL is an automatic annotation tool of the Event Structure Frame to verbs in a sentence. We apply GESL to the example sentence given for each synset of a verb in WordNet. We expect that our work will make WordNet much more useful for any NLP and its applications which require lexical semantic information of English verbs.
The paper presents current efforts towards linking two large lexical semantic resources – WordNet and FrameNet – to the end of their mutual enrichment and the facilitation of the access, extraction and analysis of various types of semantic and syntactic information. In the second part of the paper, we go on to examine the relation of inheritance and other semantic relations as represented in WordNet and FrameNet and how they correspond to each other when the resources are aligned. We discuss the implications with respect to the enhancement of the two resources through the definition of new relations and the detailisation of conceptual frames.
The paper reports on an ongoing work that manually maps the Bulgarian WordNet BTB-WN with Bulgarian Wikipedia. The preparatory work of extracting the Wikipedia articles and provisionally relating them to the WordNet lemmas was done automatically. The manual work includes checking of the corresponding senses in both resources as well as the missing ones. The main cases of mapping are considered. The first experiments of mapping about 1000 synsets show the establishment of more than 78 % of exact correspondences and nearly 15 % of new synsets.
This paper reports our efforts in constructing a sense-labeled English-Turkish parallel corpus using the traditional method of manual tagging. We tagged a pre-built parallel treebank which was translated from the Penn Treebank corpus. This approach allowed us to generate a resource combining syntactic and semantic information. We provide statistics about the corpus itself as well as information regarding its development process.
Given the fact that verbs play a crucial role in language comprehension, this paper presents a study which compares the verb senses in English PropBank with the ones in English WordNet through manual tagging. After analyzing 1554 senses in 1453 distinct verbs, we have found out that while the majority of the senses in PropBank have their one-to-one correspondents in WordNet, a substantial amount of them are differentiated. Furthermore, by analysing the differences between our manually-tagged and an automatically-tagged resource, we claim that manual tagging can help provide better results in sense annotation.
We discuss the creation of ASLNet by aligning the Princeton WordNet (PWN) with SignStudy, an online database of American Sign Language (ASL) signs. This alignment will have many immediate benefits for first and second-sign language learners as well as ASL researchers by highlighting semantic relations among signs. We begin to address the interesting theoretical question of to what extent the wordnet-style organization of the English lexicon (and those of wordnets in other spoken languages) is applicable to ASL, and whether ASL requires positing additional, language or modality-specific relations among signs. Significantly, the mapping of SignStudy and PWN provides a bridge between ASL and the worldwide wordnet community, which comprises speakers of dozens of languages working in academic and language technology settings.
In the paper, we study the case of building a keywords database related to the Polish Classification of Activities (PKD 2007). The database enables automatic classification of the companies to the industry branches. The classification is performed based on the company’s activity description. We present the initial design of the keywords database and the ways in which wordnets were used to enrich it. Finally, we present the preliminary statistical evaluation of the produced resource.
In this paper we present a novel method for emotive propagation in a wordnet based on a large emotive seed. We introduce a sense-level emotive lexicon annotated with polarity, arousal and emotions. The data were annotated as a part of a large study involving over 20,000 participants. A total of 30,000 lexical units in Polish WordNet were described with metadata, each unit received about 50 annotations concerning polarity, arousal and 8 basic emotions, marked on a multilevel scale. We present a preliminary approach to propagating emotive metadata to unlabeled lexical units based on the distribution of manual annotations using logistic regression and description of mixed synset embeddings based on our Heterogeneous Structured Synset Embeddings.
According to George K. Zipf, more frequent words have more senses. We have tested this law using corpora and wordnets of English, Spanish, Portuguese, French, Polish, Japanese, Indonesian and Chinese. We have proved that the law works pretty well for all of these languages if we take - as Zipf did - mean values of meaning count and averaged ranks. On the other hand, the law disastrously fails in predicting the number of senses for a single lemma. We have also provided the evidence that slope coefficients of Zipfian log-log linear model may vary from language to language.
The paper presents the latest release of the Polish WordNet, namely plWordNet 4.1. The most significant developments since 3.0 version include new relations for nouns and verbs, mapping semantic role-relations from the valency lexicon Walenty onto the plWordNet structure and sense-level inter-lingual mapping. Several statistics are presented in order to illustrate the development and contemporary state of the wordnet.
In this paper, we compare a variety of sense-tagged sentiment resources, including SentiWordNet, ML-Senticon, plWordNet emo and the NTU Multilingual Corpus. The goal is to investigate the quality of the resources and see how well the sentiment polarity annotation maps across languages.
Lexical resources need to be as complete as possible. Very little work seems to have been done on adverbs, the smallest part of speech class in Princeton WordNet counting the number of synsets. Amongst adverbs, manner adverbs ending in ‘-ly’ seem the easiest to work with, as their meaning is almost the same as the one of the associated adjective. This phenomenon seems to be parallel in English and Portuguese, where these manner adverbs finish in the suffix ‘-mente’. We use this correspondence to improve the coverage of adverbs in the lexical resource OpenWordNet-PT, a wordnet for Portuguese.
In the Princeton WordNet Gloss Corpus, the word forms from the definitions (“glosses”) in WordNet’s synsets are manually linked to the context-appropriate sense in the WordNet. The glosses then become a sense-disambiguated corpus annotated against WordNet version 3.0. The result is also called a semantic concordance, which can be seen as both a lexicon (WordNet extension) and an annotated corpus. In this work we motivate and present the initial steps to complete the annotation of all open-class words in this corpus. Finally, we introduce a freely-available annotation interface built as an Emacs extension, and evaluate a preliminary annotation effort.
This paper introduces a new multilingual lexicon of geographical place names. The names are based on (and linked to) the GeoNames collection. Each location is treated as a new synset, which is linked by instance_hypernym to a small set of supertypes. These supertypes are linked to the collaborative interlingual index, based on mappings from GeoDomainWordnet. If a location is already in the interlingual index, then it is also linked to the entry, using mappings from the Geo-Wordnet. Finally, if GeoNames places the location in a larger location, this is linked using the mero_location link. Wordnets can be built for any language in GeoNames, we give results for those wordnets in the Open Multilingual Wordnet. We discuss how it is mapped and the characteristics of the extracted wordnets.
This paper aims to study auto-hyponymy and auto-troponymy relations (or vertical polysemy) in 11 wordnets uploaded into the new Open Multilingual Wordnet (OMW) webpage. We investigate how vertical polysemy forms polysemy structures (or sense clusters) in semantic hierarchies of the wordnets. Our main results and discoveries are new polysemy structures that have not previously been associated with vertical polysemy, along with some inconsistencies of semantic relations analysis in the studied wordnets, which should not be there. In the case study, we turn attention to polysemy structures in the Estonian Wordnet (version 2.2.0), analyzing them and giving the lexicographers comments. In addition, we describe the detection algorithm of polysemy structures and an overview of the state of polysemy structures in 11 wordnets.
Automatic Cognate Detection (ACD) is a challenging task which has been utilized to help NLP applications like Machine Translation, Information Retrieval and Computational Phylogenetics. Unidentified cognate pairs can pose a challenge to these applications and result in a degradation of performance. In this paper, we detect cognate word pairs among ten Indian languages with Hindi and use deep learning methodologies to predict whether a word pair is cognate or not. We identify IndoWordnet as a potential resource to detect cognate word pairs based on orthographic similarity-based methods and train neural network models using the data obtained from it. We identify parallel corpora as another potential resource and perform the same experiments for them. We also validate the contribution of Wordnets through further experimentation and report improved performance of up to 26%. We discuss the nuances of cognate detection among closely related Indian languages and release the lists of detected cognates as a dataset. We also observe the behaviour of, to an extent, unrelated Indian language pairs and release the lists of detected cognates among them as well.