uppdf
bib
Proceedings of the 12th Global Wordnet Conference
German Rigau
|
Francis Bond
|
Alexandre Rademaker
pdf
bib
abs
Probing Taxonomic and Thematic Embeddings for Taxonomic Information
Filip Klubička
|
John Kelleher
Modelling taxonomic and thematic relatedness is important for building AI with comprehensive natural language understanding. The goal of this paper is to learn more about how taxonomic information is structurally encoded in embeddings. To do this, we design a new hypernym-hyponym probing task and perform a comparative probing study of taxonomic and thematic SGNS and GloVe embeddings. Our experiments indicate that both types of embeddings encode some taxonomic information, but the amount, as well as the geometric properties of the encodings, are independently related to both the encoder architecture, as well as the embedding training data. Specifically, we find that only taxonomic embeddings carry taxonomic information in their norm, which is determined by the underlying distribution in the data.
pdf
bib
abs
A WordNet View on Crosslingual Transformers
Wondimagegnhue Tufa
|
Lisa Beinborn
|
Piek Vossen
WordNet is a database that represents relations between words and concepts as an abstraction of the contexts in which words are used. Contextualized language models represent words in contexts but leave the underlying concepts implicit. In this paper, we investigate how different layers of a pre-trained language model shape the abstract lexical relationship toward the actual contextual concept. Can we define the amount of contextualized concept forming needed given the abstracted representation of a word? Specifically, we consider samples of words with different polysemy profiles shared across three languages, assuming that words with a different polysemy profile require a different degree of concept shaping by context. We conduct probing experiments to investigate the impact of prior polysemy profiles on the representation in different layers. We analyze how contextualized models can approximate meaning through context and examine crosslingual interference effects.
pdf
abs
What to Make of make? Sense Distinctions for Light Verbs
Julie Kallini
|
Christiane Fellbaum
Verbs like make, have and get present challenges for applications requiring automatic word sense discrimination. These verbs are both highly frequent and polysemous, with semantically “full” readings, as in make dinner, and “light” readings, as in make a request. Lexical resources like WordNet encode dozens of senses, making discrimination difficult and inviting proposals for reducing the number of entries or grouping them into coarser-grained supersenses. We propose a data-driven, linguistically-based approach to establishing a motivated sense inventory, focusing on make to establish a proof of concept. From several large, syntactically annotated corpora, we extract nouns that are complements of the verb make, and group them into clusters based on their Word2Vec semantic vectors. We manually inspect, for each cluster, the words with vectors closest to the centroid as well as a random sample of words within the cluster. The results show that the clusters reflect an intuitively plausible sense discrimination of make. As an evaluation, we test whether words within a given cluster cooccur in coordination phrases, such as apples and oranges, as prior work has shown that such conjoined nouns are semantically related. Conversely, noun complements from different clusters are less likely to be conjoined. Thus, coordination provides a similarity metric independent of the contextual embeddings used for clustering. Our results pave the way for a WordNet sense inventory that, while not inconsistent with the present one, would reduce it significantly and hold promise for improved automatic word sense discrimination.
pdf
abs
Towards Effective Correction Methods Using WordNet Meronymy Relations
Javier Álvez
|
Itziar Gonzalez-Dios
|
German Rigau
In this paper, we analyse and compare several correction methods of knowledge resources with the purpose of improving the abilities of systems that require commonsense reasoning with the least possible human-effort. To this end, we cross-check the WordNet meronymy relation member against the knowledge encoded in a SUMO-based first-order logic ontology on the basis of the mapping between WordNet and SUMO. In particular, we focus on the knowledge in WordNet regarding the taxonomy of animals and plants. Despite being created manually, these knowledge resources — WordNet, SUMO and their mapping — are not free of errors and discrepancies. Thus, we propose three correction methods by semi-automatically improving the alignment between WordNet and SUMO, by performing some few corrections in SUMO and by combining the above two strategies. The evaluation of each method includes the required human-effort and the achieved improvement on unseen data from the WebChild project, that is tested using first-order logic automated theorem provers.
pdf
abs
On the Acquisition of WordNet Relations in Portuguese from Pretrained Masked Language Models
Hugo Gonçalo Oliveira
This paper studies the application of pretrained BERT in the acquisition of synonyms, antonyms, hypernyms and hyponyms in Portuguese. Masked patterns indicating those relations were compiled with the help of a service for validating semantic relations, and then used for prompting three pretrained BERT models, one multilingual and two for Portuguese (base and large). Predictions for the masks were evaluated in two different test sets. Results achieved by the monolingual models are interesting enough for considering these models as a source for enriching wordnets, especially when predicting hypernyms of nouns. Previously reported performances on prediction were improved with new patterns and with the large model. When it comes to selecting the related word from a set of four options, performance is even better, but not enough for outperforming the selection of the most similar word, as computed with static word embeddings.
pdf
abs
Wordnet for Definition Augmentation with Encoder-Decoder Architecture
Konrad Wojtasik
|
Arkadiusz Janz
|
Maciej Piasecki
Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or substitution, mostly result in changing the sentence meaning significantly and obtaining an incorrect example. Wordnets are potentially a perfect source of rich and high quality data that when integrated with the powerful capacity of generative models can help to solve this complex task. In this work, we use plWordNet, which is a wordnet of the Polish language, to explore the capability of encoder-decoder architectures in data augmentation of sense glosses. We discuss the limitations of generative methods and perform qualitative review of generated data samples.
pdf
abs
Data Augmentation Method for Boosting Multilingual Word Sense Disambiguation
Arkadiusz Janz
|
Marek Maziarz
Recent advances in Word Sense Disambiguation suggest neural language models can be successfully improved by incorporating knowledge base structure. Such class of models are called hybrid solutions. We propose a method of improving hybrid WSD models by harnessing data augmentation techniques and bilingual training. The data augmentation consist of structure augmentation using interlingual connections between wordnets and text data augmentation based on multilingual glosses and usage examples. We utilise language-agnostic neural model trained both with SemCor and Princeton WordNet gloss and example corpora, as well as with Polish WordNet glosses and usage examples. This augmentation technique proves to make well-known hybrid WSD architecture to be competitive, when compared to current State-of-the-Art models, even more complex.
pdf
abs
Mapping Wordnets on the Fly with Permanent Sense Keys
Eric Kafe
Most of the major databases on the semantic web have links to Princeton WordNet (PWN) synonym set (synset) identifiers, which differ for each PWN release, and are thus incompatible between versions. On the other hand, both PWN and the more recent Open English Wordnet (OEWN) provide permanent word sense identifiers (the sense keys), which can solve this interoperability problem. We present an algorithm that runs in linear time, to automatically derive a synset mapping between any pair of Wordnet versions that use PWN sense keys. This allows to update old WordNet links, and seamlessly interoperate with newer English Wordnet versions for which no prior mapping exists. By applying the proposed algorithm on the fly, at load time, we combine the Open Multilingual Wordnet (OMW 1.4, which uses old PWN 3.0 identifiers) with OEWN Edition 2021, and obtain almost perfect precision and recall. We compare the results of our approach using respectively synset offsets, versus the Collaborative InterLingual Index (CILI version 1.0) as synset identifiers, and find that the synset offsets perform better than CILI 1.0 in all cases, except a few ties.
pdf
abs
Linking the Sanskrit WordNet to the Vedic Dependency Treebank: a pilot study
Erica Biagetti
|
Chiara Zanchi
|
Silvia Luraghi
The Sanskrit WordNet is a resource currently under development, whose core was induced from a Vedic text sample semantically annotated by means of an ontology mapped on the Princeton WordNet synsets. Building on a previous case study on Ancient Greek (Zanchi et al. 2021), we show how sentence frames can be extracted from morphosyntactically parsed corpora by linking an existing dependency treebank of Vedic Sanskrit to verbal synsets in the Sanskrit WordNet. Our case study focuses on two verbs of asking, yāc- and prach-, featuring a high degree of variability in sentence frames. Treebanks enhanced with WordNet-based semantic information revealed to be of crucial help in motivating sentence frame alternations.
pdf
abs
StarNet: A WordNet Editor Interface
Oğuzhan Kuyrukçu
|
Ezgi Sanıyar
|
Olcay Taner Yildiz
In this paper, we introduce StarNet WordNet Editor, an open-source annotation tool designed for natural language processing. It’s mainly used for creating and maintaining machine-readable dictionaries like WordNet (Miller, 1995) or domain-specific dictionaries. WordNet editor provides a user friendly interface and since it is open-source, it is easy to use and develop. Besides English and Turkish WordNet (KeNet) (Bakay et al., 2020), it is also applicable to several languages and their domain specific dictionaries.
pdf
abs
Identifying FrameNet Lexical Semantic Structures for Knowledge Graph Extraction from Financial Customer Interactions
Cécile Robin
|
Atharva Kulkarni
|
Paul Buitelaar
We explore the use of the well established lexical resource and theory of the Berkeley FrameNet project to support the creation of a domain-specific knowledge graph in the financial domain, more precisely from financial customer interactions. We introduce a domain independent and unsupervised method that can be used across multiple applications, and test our experiments on the financial domain. We use an existing tool for term extraction and taxonomy generation in combination with information taken from FrameNet. By using principles from frame semantic theory, we show that we can connect domain-specific terms with their semantic concepts (semantic frames) and their properties (frame elements) to enrich knowledge about these terms, in order to improve the customer experience in customer-agent dialogue settings.
pdf
abs
Some Considerations in the Construction of a Historical Language WordNet
Fahad Khan
|
John P. McCrae
|
Francisco Javier Minaya Gómez
|
Rafael Cruz González
|
Javier E. Díaz-Vera
This article describes the manual construction of a part of the Old English WordNet (Old-EWN) covering the semantic field of emotion terms. This manually constructed part of the wordnet is to be eventually integrated with the automatically generated/manually checked part covering the whole of the rest of the Old English lexicon (currently under construction). We present the workflow for the definition of these emotion synsets on the basis of a dataset produced by a specialist in this area. We also look at the enrichment of the original Global WordNet Association Lexical Markup Framework (GWA LMF) schema to include the extra information which this part of the OldEWN requires. In the final part of the article we discuss how the wordnet style of lexicon organisation can be used to share and disseminate research findings/datasets in lexical semantics.
pdf
abs
Hidden in Plain Sight: Can German Wiktionary and Wordnets Facilitate the Detection of Antithesis?
Ramona Kuehn
|
Jelena Mitrović
|
Michael Granitzer
Existing wordnets mainly focus on synonyms, while antonyms have often been neglected, especially in wordnets in languages other than English. In this paper, we show how regular expressions are used to generate an antonym resource for German by using Wiktionary as a source. This resource contains antonyms for 45499 words. The antonyms can be used to extend existing wordnets. We show that this is important by comparing our antonym resource to the antonyms in OdeNet, the only freely available German wordnet that contains antonyms for 3059 words. We demonstrate that antonyms are relevant for the detection of the rhetorical figure antithesis. This figure has been known to influence the audience by creating contradiction and using a parallel sentence structure combined with antonyms. We first detect parallelism with part-of-speech tags and then apply our rule-based antithesis detection algorithm to a dataset of the messenger service Telegram. We evaluate our approach and achieve a precision of 57% and a recall of 45% thus overcoming the existing approaches.
pdf
abs
How do We Treat Systematic Polysemy in Wordnets and Similar Resources? – Using Human Intuition and Contextualized Embeddings as Guidance
Nathalie Sørensen
|
Sanni Nimb
|
Bolette Pedersen
Systematic polysemy is a well-known linguistic phenomenon where a group of lemmas follow the same polysemy pattern. However, when compiling a lexical resource like a wordnet, a problem arises regarding when to underspecify the two (or more) meanings by one (complex) sense and when to systematically split into separate senses. In this work, we present an extensive analysis of the systematic polysemy patterns in Danish, and in our preliminary study, we examine a subset of these with experiments on human intuition and contextual embeddings. The aim of this preparatory work is to enable future guidelines for each polysemy type. In the future, we hope to expand this approach and thereby hopefully obtain a sense inventory which is distributionally verified and thereby more suitable for NLP.
pdf
abs
The Romanian Wordnet in Linked Open Data Format
Elena Irimia
|
Verginica Mititelu
In this paper we present the standardization of the Romanian Wordnet by means of conversion to the Linked Open Data format. We describe the vocabularies used to encode data and metadata of this resource. The decisions made are in accordance with the characteristics of the Romanian Wordnet, which are the outcome of the development method, enrichment strategies and resources used for its creations. By interlinking with other resources, words in the Romanian Wordnet have now the pronunciation associated, as well as syntagmatic information, in the form of contexts of occurrences.
pdf
abs
Combining WordNets with Treebanks to study idiomatic language: A pilot study on Rigvedic formulas through the lenses of the Sanskrit WordNet and the Vedic Treebank
Luca Brigada Villa
|
Erica Biagetti
|
Riccardo Ginevra
|
Chiara Zanchi
This paper shows how WordNets can be employed in tandem with morpho-syntactically annotated corpora to study poetic formulas. Pairing the lexico-semantic information of the Sanskrit WordNet with morpho-syntactic annotation from the Vedic Treebank, we perform a pilot study of formulas including SPEECH verbs in the RigVeda, the most ancient text of the. Sanskrit literature.
pdf
abs
Word Sense Disambiguation Based on Iterative Activation Spreading with Contextual Embeddings for Sense Matching
Arkadiusz Janz
|
Maciej Piasecki
Many knowledge-based solutions were proposed to solve Word Sense Disambiguation (WSD) problem with limited annotated resources. Such WSD algorithms are able to cover very large sense repositories, but still being outperformed by supervised ones on benchmark data. In this paper, we start with analysis identifying key properties and issues in application of spreading activation algorithms in knowledge-based WSD, e.g. influence of the network local structures, interaction with context information and sense frequency. Taking our observations as a point of departure, we introduce a novel solution with new context-to-sense matching using BERT embeddings, iterative parallel spreading activation function and selective sense alignment using contextual BERT embeddings. The proposed solution obtains performance beyond the state-of-the-art for the contemporary knowledge-based WSD approaches for both English and Polish data.
pdf
abs
Documenting the Open Multilingual Wordnet
Francis Bond
|
Michael Wayne Goodman
|
Ewa Rudnicka
|
Luis Morgado da Costa
|
Alexandre Rademaker
|
John P. McCrae
In this project note we describe our work to make better documentation for the Open Multilingual Wordnet (OMW), a platform integrating many open wordnets. This includes the documentation of the OMW website itself as well as of semantic relations used by the component wordnets. Some of this documentation work was done with the support of the Google Season of Docs. The OMW project page, which links both to the actual OMW server and the documentation has been moved to a new location: https://omwn.org.
pdf
abs
Mapping GermaNet for the Semantic Web using OntoLex-Lemon
Claus Zinn
|
Marie Hinrichs
|
Erhard Hinrichs
GermaNet is a large lexical-semantic net that relates German nouns, verbs, and adjectives semantically. The word net has been manually constructed over the last 25 years and hence presents a high-quality, valuable resource for German. While GermaNet is maintained in a Postgres database, all its content can be exported as an XML-based serialisation. Recently, this XML representation has been converted into RDF, largely by staying close to GermaNet’s principle of arrangement where lexunits that share the same meaning are grouped together into so-called synsets. With each lexical unit and synset now globally addressable via a unique resource identifier, it has become much easier to link together GermaNet entries with other lexical and semantic resources. In terms of semantic interoperability, however, the RDF variant of GermaNet leaves much to be desired. In this paper, we describe yet another conversion from GermaNet’s XML representation to RDF. The new conversion makes use of the OntoLex-Lemon ontology, and therefore, presents a decisive step toward a GermaNet representation with a much higher level of semantic interoperability, and which makes it possible to use GermaNet with other wordnets that already support this conceptualisation of lexica.
pdf
abs
Incorporating prepositions in the BulTreeBank WordNet
Zara Kancheva
A model for preposition incorporation in the BulTreeBank WordNet is presented which follows the model for presenting open class words in wordnets. An adapted semantic classification of prepositions is done on the base of Bulgarian grammars and the classes are used for synset categories. The good coverage of prepositions in the wordnet will be used for the aim of neural language models creation for Bulgarian. This extension of the wordnet improves its utility for semantic annotation.
pdf
abs
Are there just WordNets or also SignNets?
Ineke Schuurman
|
Thierry Declerck
|
Caro Brosens
|
Margot Janssens
|
Vincent Vandeghinste
|
Bram Vanroy
For Sign Languages (SLs), can we create a SignNet, like a WordNet for spoken languages: a network of semantic relations between constitutive elements of SLs? We first discuss approaches that link SL data to wordnets, or integrate such elements with some adaptations into the structure of WordNet. Then, we present requirements for a SignNet, which is built on SL data and then linked to WordNet.
pdf
abs
The Japanese Wordnet 2.0
Francis Bond
|
Takayuki Kuribayashi
This paper describes a new release of the Japanese wordnet. It uses the new global wordnet formats (McCrae et al., 2021) to incorporate a range of new information: orthographic variants (including hiragana, katakana and Latin representations) first described in Kuroda et al. (2011), classifiers, pronouns and exclamatives (Morgado da Costa and Bond, 2016) and many new senses, motivated both from corpus annotation and linking to the TUFs basic vocabulary (Bond et al., 2020). The wordnet has been moved to github and is available at https://bond-lab.github.io/wnja/.
pdf
abs
Latvian WordNet
Peteris Paikens
|
Agute Klints
|
Ilze Lokmane
|
Lauma Pretkalniņa
|
Laura Rituma
|
Madara Stāde
|
Laine Strankale
This paper describes the recently developed Latvian WordNet and the main linguistic principles used in its development. The inventory of words and senses is based on the Te̅zaurs.lv online dictionary, restructuring the senses of the most frequently used words based on corpus evidence. The semantic linking methodology adapts Princeton WordNet principles to fit the Latvian language usage and existing linguistic tradition. The semantic links include hyponymy, meronymy, antonymy, similarity, conceptual connection and gradation. We also measure inter-annotator agreement for different types of semantic links. The dataset consists of 7609 words linked in 6515 synsets. 1266 of these words are considered fully completed as they have all the outgoing semantic links annotated, corpus examples assigned for each sense, as well as links to the English Princeton WordNet formed. The data is available to the public on Te̅zaurs.lv as an addition to the general dictionary data, and is also published as a downloadable dataset.
pdf
abs
Initial Experiments for Building a Guarani WordNet
Luis Chiruzzo
|
Marvin Agüero-Torales
|
Aldo Alvarez
|
Yliana Rodríguez
This paper presents a work in progress about creating a Guarani version of the WordNet database. Guarani is an indigenous South American language and is a low-resource language from the NLP perspective. Following the expand approach, we aim to find Guarani lemmas that correspond to the concepts defined in WordNet. We do this through three strategies that try to select the correct lemmas from Guarani-Spanish datasets. We ran them through three different bilingual dictionaries and had native speakers assess the results. This procedure found Guarani lemmas for about 6.5 thousand synsets, including 27% of the base WordNet concepts. However, more work on the quality of the selected words will be needed in order to create a final version of the dataset.
pdf
abs
A CCGbank for Turkish: From Dependency to CCG
Aslı Kuzgun
|
Oğuz Kerem Yıldız
|
Olcay Taner Yildiz
In this paper, we present the building of a CCGbank for Turkish by using standardised dependency corpora. We automatically induce Combinatory Categorial Grammar (CCG) categories for each word token in the Turkish dependency corpora. The CCG induction algorithm we present here is based on the dependency relations that are defined in the latest release of the Universal Dependencies (UD) framework. We aim for an algorithm that can easily be used in all the Turkish treebanks that are annotated in this framework. Therefore, we employ a lexicalist approach in order to make full use of the dependency relations while creating a semantically transparent corpus. We present the treebanks we employed in this study as well as their annotation framework. We introduce the structure of the algorithm we used along with the specific issues that are different from previous studies. Lastly, we show how the results change with this lexical approach in CCGbank for Turkish compared to the previous CCGbank studies in Turkish.
pdf
abs
Reusing the Danish WordNet for a New Central Word Register for Danish - a Project Report
Bolette Pedersen
|
Sanni Nimb
|
Nathalie Sørensen
|
Sussi Olsen
|
Ida Flörke
|
Thomas Troelsgård
In this paper we report on a new Danish lexical initiative, the Central Word Register for Danish, (COR), which aims at providing an open-source, well curated and large-coverage lexicon for AI purposes. The semantic part of the lexicon (COR-S) relies to a large extent on the lexical-semantic information provided in the Danish wordnet, DanNet. However, we have taken the opportunity to evaluate and curate the wordnet information while compiling the new resource. Some information types have been simplified and more systematically curated. This is the case for the hyponymy relations, the ontological typing, and the sense inventory, i.e. the treatment of polysemy, including systematic polysemy.
pdf
abs
Recent Developments in BTB-WordNet
Kiril Simov
|
Petya Osenova
The paper reports on recent developments in Bulgarian BTB-WordNet (BTB-WN). This resource is viewed as playing a central role with respect to the integration and interlinking of various language resources such as: e-dictionaries (morphological, terminological, bilingual, orthographic, etymological and explanatory, etc., including editions from previous periods); corpora (coming from outside or being internal - like the corpus of definitions as well as the corpus of examples to synset meanings); ontologies (such as CIDOC-CRM, DBpedia, etc.); sources of world knowledge (such as information from the Bulgarian Encyclopedia, Wikipedia, etc.). The paper also gives information about a number of applications built on BTB-WN. These are: the Bulgaria-centered knowledge graph, the All about word application as well as some education-oriented exercises.
pdf
abs
Lexicalised and non-lexicalized multi-word expressions in WordNet: a cross-encoder approach
Marek Maziarz
|
Łukasz Grabowski
|
Tadeusz Piotrowski
|
Ewa Rudnicka
|
Maciej Piasecki
Focusing on recognition of multi-word expressions (MWEs), we address the problem of recording MWEs in WordNet. In fact, not all MWEs recorded in that lexical database could with no doubt be considered as lexicalised (e.g. elements of wordnet taxonomy, quantifier phrases, certain collocations). In this paper, we use a cross-encoder approach to improve our earlier method of distinguishing between lexicalised and non-lexicalised MWEs found in WordNet using custom-designed rule-based and statistical approaches. We achieve F1-measure for the class of lexicalised word combinations close to 80%, easily beating two baselines (random and a majority class one). Language model also proves to be better than a feature-based logistic regression model.
pdf
abs
Towards an RDF Representation of the Infrastructure consisting in using Wordnets as a conceptual Interlingua between multilingual Sign Language Datasets
Thierry Declerck
|
Thomas Troelsgård
|
Sussi Olsen
We present ongoing work dealing with a Linked Data compliant representation of infrastructures using wordnets for connecting multilingual Sign Language data sets. We build for this on already existing RDF and OntoLex representations of Open Multilingual Wordnet (OMW) data sets and work done by the European EASIER research project on the use of the CSV files of OMW for linking glosses and basic semantic information associated with Sign Language data sets in two languages: German and Greek. In this context, we started the transformation into RDF of a Danish data set, which links Danish Sign Language data and the wordnet for Danish, DanNet. The final objective of our work is to include Sign Language data sets (and their conceptual cross-linking via wordnets) in the Linguistic Linked Open Data cloud.
pdf
abs
Semantic Parsing and Sense Tagging the Princeton WordNet Gloss Corpus
Alexandre Rademaker
|
Abhishek Basu
|
Rajkiran Veluri
In 2008, the Princeton team released the last version of the “Princeton Annotated Gloss Corpus”. In this corpus. The word forms from the definitions and examples (glosses) of Princeton WordNet are manually linked to the context-appropriate sense in WordNet. However, the annotation was not complete, and the dataset was never officially released as part of WordNet 3.0, remaining as one of the standoff files available for download. Eleven years later, in 2019, one of the authors of this paper restarted the project aiming to complete the sense annotation of the approximately 200 thousand word forms not yet annotated. Here, we provide additional motivations to complete this dataset and report the progress in the work and evaluations. Intending to provide an extra level of consistency in the sense annotation and a deep semantic representation of the definitions and examples promoting WordNet from a lexical resource to a lightweight ontology, we now employ the English Resource Grammar (ERG), a broad-coverage HPSG grammar of English to parse the sentences and project the sense annotations from the surface words to the ERG predicates. We also report some initial steps on upgrading the corpus to WordNet 3.1 to facilitate mapping the data to other lexical resources.
pdf
abs
Context-Gloss Augmentation for Improving Arabic Target Sense Verification
Sanad Malaysha
|
Mustafa Jarrar
|
Mohammed Khalilia
Arabic language lacks semantic datasets and sense inventories. The most common semantically-labeled dataset for Arabic is the ArabGlossBERT, a relatively small dataset that consists of 167K context-gloss pairs (about 60K positive and 107K negative pairs), collected from Arabic dictionaries. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using (Arabic-English-Arabic) machine back-translation. Augmentation increased the dataset size to 352K pairs (149K positive and 203K negative pairs). We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task. Overall, the accuracy ranges between 78% to 84% for different data configurations. Although our approach performed at par with the baseline, we did observe some improvements for some POS tags in some experiments. Furthermore, our fine-tuned models are trained on a larger dataset covering larger vocabulary and contexts. We provide an in-depth analysis of the accuracy for each part-of-speech (POS).
pdf
abs
The Open Cantonese Sense-Tagged Corpus
Joanna Sio
|
Luis Morgado Da Costa
This paper introduces the Open Cantonese Sense-Tagged Corpus, a new and ongoing project to serve as the companion to the development of the Cantonese Wordnet. This corpus is built on top of the Cantonese Wordnet Corpus, which currently provides example sentences for most verbs in this wordnet. This paper motivates the choice of starting a sense-tagged corpus from both linguistic and educational perspectives, and discusses the current solutions to issues arisen from the sense-tagging exercise. In total, we have tagged over 5,000 concepts, with more than 3,700 direct links to the Cantonese Wordnet.
pdf
abs
Correcting Sense Annotations Using Wordnets and Translations
Arnob Mallik
|
Grzegorz Kondrak
Acquiring large amounts of high-quality annotated data is an open issue in word sense disambiguation. This problem has become more critical recently with the advent of supervised models based on neural networks, which require large amounts of annotated data. We propose two algorithms for making selective corrections on a sense-annotated parallel corpus, based on cross-lingual synset mappings. We show that, when applied to bilingual parallel corpora, these algorithms can rectify noisy sense annotations, and thereby produce multilingual sense-annotated data of improved quality.
pdf
abs
A Benchmark and Scoring Algorithm for Enriching Arabic Synonyms
Sana Ghanem
|
Mustafa Jarrar
|
Radi Jarrar
|
Ibrahim Bounhas
This paper addresses the task of extending a given synset with additional synonyms taking into account synonymy strength as a fuzzy value. Given a mono/multilingual synset and a threshold (a fuzzy value [0−1]), our goal is to extract new synonyms above this threshold from existing lexicons. We present twofold contributions: an algorithm and a benchmark dataset. The dataset consists of 3K candidate synonyms for 500 synsets. Each candidate synonym is annotated with a fuzzy value by four linguists. The dataset is important for (i) understanding how much linguists (dis/)agree on synonymy, in addition to (ii) using the dataset as a baseline to evaluate our algorithm. Our proposed algorithm extracts synonyms from existing lexicons and computes a fuzzy value for each candidate. Our evaluations show that the algorithm behaves like a linguist and its fuzzy values are close to those proposed by linguists (using RMSE and MAE). The dataset and a demo page are publicly available at https://portal.sina.birzeit.edu/synonyms.
pdf
abs
Expanding the Conceptual Description of Verbs in WordNet with Semantic and Syntactic Information
Ivelina Stoyanova
|
Svetlozara Leseva
This paper describes an ongoing effort towards expanding the semantic and conceptual description of verbs in WordNet by combining information from two other resources, FrameNet and VerbNet, as well as enriching the verbs’ description with syntactic patterns extracted from the three resources. The conceptual description of verb synsets is provided by assigning a FrameNet frame which provides the relevant set of frame elements denoting the predicate’s participants and props. This information is supplemented by assigning a VerbNet class and the set of semantic roles associated with it. The information extracted from FrameNet and VerbNet and assigned to a synset is aligned (semi-automatically with subsequent manual corrections) at the following levels: (i) FrameNet frame: VerbNet class; (ii) FrameNet frame elements: VerbNet semantic roles; (iii) FrameNet semantic types and restrictions: VerbNet selectional restrictions. We then link the syntactic patterns associated with the units in FrameNet, VerbNet and WordNet, by unifying their representation and by matching the corresponding patterns at the level of syntactic groups. The alignment of the semantic components and their syntactic realisations is essential for the better exploitation of the abundance of information across resources, including shedding light on cross-resource similarities, discrepancies and inconsistencies. The syntactic patterns can facilitate the extraction of examples illustrating the use of verb synset literals in corpora and their semantic characterisation through the association of the syntactic groups with the components of semantic description (frame elements or semantic roles) and can be employed in various tasks requiring semantic and syntactic description. The resource is publicly available to the community. The components of the conceptual description are visualised showing the links to the original resources each component is drawn from.
pdf
abs
An Experiment: Finding Parents for Parentless Synsets by Means of CILI
Ahti Lohk
|
Martin Rebane
|
Heili Orav
Identifying and correcting inconsistencies in wordnets is a natural part of their development. Focusing only on the subproblem of missing links, we aim to find automatically possible parents for parentless synsets in IS-A hierarchies of a target wordnet by means of source wordnets where target and source wordnets are in XML-format and equipped with Collaborative Interlingual Index (CILI). In this paper, we describe the algorithm and provide statistics on the possible parents of parentless synsets of the wordnets included in the study. Additionally, we investigate the suitability of the proposed potential parent synsets for correcting noun and verb synsets within the Estonian wordnet.
pdf
abs
Extending the usage of adjectives in the Zulu AfWN
Laurette Marais
|
Laurette Pretorius
The African languages Wordnet (AfWN) for Zulu (ZWN) was built using the expand approach, which relies on the translation of concepts in the Princeton WordNet (PWN), while retaining their PWN lexical categories. In this paper the focus is on the adjective as PWN lexical category. What is considered adjectival information (provided both attributively and predicatively) in English, is usually verbalised quite differently in Zulu - often as verb or copulative constructions - as may be seen by inspecting the Zulu written forms in “adjective” entries in ZWN. These written forms are not complete Zulu verb or copulative constructions and in order for them to be useful, tense, polarity and agreement have to be added. This paper presents a grammar-based approach to recover important morphosyntactic information implicit in the ZWN “adjective” written forms in order to derive a tool that would assist a user of the ZWN to render and analyse correct full forms automatically as desired by the context in which an “adjective” is used.
pdf
abs
Linking SIL Semantic Domains to Wordnet and Expanding the Abui Wordnet through Rapid Word Collection Methodology
Luis Morgado Da Costa
|
František Kratochvíl
|
George Saad
|
Benidiktus Delpada
|
Daniel Simon Lanma
|
Francis Bond
|
Natálie Wolfová
|
A.l. Blake
In this paper we describe a new methodology to expand the Abui Wordnet through data collected using the Rapid Word Collection (RWC) method – based on SIL’s Semantic Domains. Using a multilingual sense-intersection algorithm, we created a ranked list of concept suggestions for each domain, and then used the ranked list as a filter to link the Abui RWC data to wordnet. This used translations from both SIL’s Semantic Domain’s structure and example words, both available through SIL’s Fieldworks software and the RWC project. We release both the new mapping of the SIL Semantic Domains to wordnet and an expansion of the Abui Wordnet.
pdf
abs
Wordnet-oriented recognition of derivational relations
Wiktor Walentynowicz
|
Maciej Piasecki
Derivational relations are an important element in defining meanings, as they help to explore word-formation schemes and predict senses of derivates (derived words). In this work, we analyse different methods of representing derivational forms obtained from WordNet – from quantitative vectors to contextual learned embedding methods – and compare ways of classifying the derivational relations occurring between them. Our research focuses on the explainability of the obtained representations and results. The data source for our research is plWordNet, which is the wordnet of the Polish language and includes a rich set of derivation examples.
pdf
abs
What do Language Models know about word senses? Zero-Shot WSD with Language Models and Domain Inventories
Oscar Sainz
|
Oier Lopez de Lacalle
|
Eneko Agirre
|
German Rigau
Language Models are the core for almost any Natural Language Processing system nowadays. One of their particularities is their contextualized representations, a game changer feature when a disambiguation between word senses is necessary. In this paper we aim to explore to what extent language models are capable of discerning among senses at inference time. We performed this analysis by prompting commonly used Languages Models such as BERT or RoBERTa to perform the task of Word Sense Disambiguation (WSD). We leverage the relation between word senses and domains, and cast WSD as a textual entailment problem, where the different hypothesis refer to the domains of the word senses. Our results show that this approach is indeed effective, close to supervised systems.
pdf
abs
Resolving Multiple Hyperonymy
Svetla Koeva
|
Dimitar Hristov
WordNet contains a fair number of synsets with multiple hyperonyms. In parent–child relations, a child can have only one parent (ancestor). Consequently, multiple hyperonymy represents distinct semantic relations. In order to reclassify the multiple hyperonyms, we define a small set of new semantic relations (such as function, origin and form) that cover the various instances of multiple hyperonyms. The synsets with multiple hyperonyms that lead to the same root and belong to the same semantic class were grouped automatically, resulting in semantic patterns that serve as a point of departure for the classification. The proposed changes are based on semantic analysis and may involve the redefinition of one or several multiple hyperonymy relations to new ones, the removal of one or several multiple hyperonymy relations, and rarely the addition of a new hyperonymy relation. As a result, we incorporate the newly defined semantic relations that resolve the former multiple hyperonymy relations and propose an updated WordNet structure without multiple hyperonyms. The resulting WordNet structure without multiple hyperonyms may be used for a variety of purposes that require proper inheritance.
pdf
abs
Towards the integration of WordNet into ClinIDMap
Elena Zotova
|
Montse Cuadros
|
German Rigau
This paper presents the integration of WordNet knowledge resource into ClinIDMap tool, which aims to map identifiers between clinical ontologies and lexical resources. ClinIDMap interlinks identifiers from UMLS, SMOMED-CT, ICD-10 and the corresponding Wikidata and Wikipedia articles for concepts from the UMLS Metathesaurus. The main goal of the tool is to provide semantic interoperability across the clinical concepts from various knowledge bases. As a side effect, the mapping enriches already annotated medical corpora in multiple languages with new labels. In this new release, we add WordNet 3.0 and 3.1 synsets using the available mappings through Wikidata. Thanks to cross-lingual links in MCR we also include the corresponding synsets in other languages and also, extend further ClinIDMap with different domain information. Finally, the final resource helps in the task of enriching of already annotated clinical corpora with additional semantic annotations.
pdf
abs
Connecting Multilingual Wordnets: Strategies for Improving ILI Classification in OdeNet
Melanie Siegel
|
Johann Bergh
The Open Multilingual Wordnet (OMW) is an open source project that was launched with the goal to make it easy to use wordnets in multiple languages without having to pay expensive proprietary licensing costs. As OMW evolved, the interlingual indicator (ILI)1 was used to allow semantically equivalent synsets in different languages to be linked to each other. OdeNet2 is the German language wordnet which forms part of the OMW project. This paper analyses the shortcomings of the initial ILI classification in OdeNet and the consequent methods used to improve this classification.