Proceedings of the 2020 Globalex Workshop on Linked Lexicography
The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.
This paper reports on an extended version of a synonym verb class lexicon, newly called SynSemClass (formerly CzEngClass). This lexicon stores cross-lingual semantically similar verb senses in synonym classes extracted from a richly annotated parallel corpus, the Prague Czech-English Dependency Treebank. When building the lexicon, we make use of predicate-argument relations (valency) and link them to semantic roles; in addition, each entry is linked to several external lexicons of more or less “semantic” nature, namely FrameNet, WordNet, VerbNet, OntoNotes and PropBank, and Czech VALLEX. The aim is to provide a linguistic resource that can be used to compare semantic roles and their syntactic properties and features across languages within and across synonym groups (classes, or ’synsets’), as well as gold standard data for automatic NLP experiments with such synonyms, such as synonym discovery, feature mapping, etc. However, perhaps the most important goal is to eventually build an event type ontology that can be referenced and used as a human-readable and human-understandable “database” for all types of events, processes and states. While the current paper describes primarily the content of the lexicon, we are also presenting a preliminary design of a format compatible with Linked Data, on which we are hoping to get feedback during discussions at the workshop. Once the resource (in whichever form) is applied to corpus annotation, deep analysis will be possible using such combined resources as training data.
In this paper we describe the process of inclusion of etymological information in a knowledge base of interoperable Latin linguistic resources developed in the context of the LiLa: Linking Latin project. Interoperability is obtained by applying the Linked Open Data principles. Particularly, an extensive collection of Latin lemmas is used to link the (distributed) resources. For the etymology, we rely on the Ontolex-lemon ontology and the lemonEty extension to model the information, while the source data are taken from a recent etymological dictionary of Latin. As a result, the collection of lemmas LiLa is built around now includes 1,465 Proto-Italic and 1,393 Proto-Indo-European reconstructed forms that are used to explain the history of 1,400 Latin words. We discuss the motivation, methodology and modeling strategies of the work, as well as its possible applications and potential future developments.
We present the ongoing work on an automatically generated dictionary describing Danish in the 16th century. A series of relevant dictionaries – from the period as well as more recent ones – are linked together at lemma level, and where possible, definitions or keywords are extracted and presented in the new dictionary.
This extended abstract presents on-going work consisting in interlinking and merging the Open Dutch WordNet and generic lexicographic resources for Dutch, focusing for now on the Dutch and English versions of Wiktionary and using the Algemeen Nederlands Woordenboek as a quality checking instance. As the Open Dutch WordNet is already equipped with a relevant number of complex lexical units, we are aiming at expanding it and proposing a new representational framework for the encoding of the interlinked and integrated data. The longer term goal of the work is to investigate if and on how senses can be restricted to particular morphological variations of Dutch lexical entries, and how to represent this information in a Linguistic Linked Open Data compliant format.
There are wordnets in many languages, many aligned with Princeton WordNet, some of which in a (semi-)automatic process, but we rarely see actual discussions on the role of false friends in this process. Having in mind known issues related to such words in language translation, and further motivated by false friend-related issues on the alignment of a Portuguese wordnet with Princeton Wordnet, we aim to widen this discussion, while suggesting preliminary ideas of how wordnets could benefit from this kind of research.
This paper describes the development and current state of Pinchah Kristang – an online dictionary for Kristang. Kristang is a critically endangered language of the Portuguese-Eurasian communities residing mainly in Malacca and Singapore. Pinchah Kristang has been a central tool to the revitalization efforts of Kristang in Singapore, and collates information from multiple sources, including existing dictionaries and wordlists, ongoing language documentation work, and new words that emerge regularly from relexification efforts by the community. This online dictionary is powered by the Princeton Wordnet and the Open Kristang Wordnet – a choice that brings both advantages and disadvantages. This paper will introduce the current version of this dictionary, motivate some of its design choices, and discuss possible future directions.
Our aim is to identify suitable sense representations for NLP in Danish. We investigate sense inventories that correlate with human interpretations of word meaning and ambiguity as typically described in dictionaries and wordnets and that are well reflected distributionally as expressed in word embeddings. To this end, we study a number of highly ambiguous Danish nouns and examine the effectiveness of sense representations constructed by combining vectors from a distributional model with the information from a wordnet. We establish representations based on centroids obtained from wordnet synests and example sentences as well as representations established via are tested in a word sense disambiguation task. We conclude that the more information extracted from the wordnet entries (example sentence, definition, semantic relations) the more successful the sense representation vector.
Bring’s thesaurus (Bring) is a Swedish counterpart of Roget, and its digitized version could make a valuable language resource for use in many and diverse natural language processing (NLP) applications. From the literature we know that Roget-style thesauruses and wordnets have complementary strengths in this context, so both kinds of lexical-semantic resource are good to have. However, Bring was published in 1930, and its lexical items are in the form of lemma–POS pairings. In order to be useful in our NLP systems, polysemous lexical items need to be disambiguated, and a large amount of modern vocabulary must be added in the proper places in Bring. The work presented here describes experiments aiming at automating these two tasks, at least in part, where we use the structure of an existing Swedish semantic lexicon – Saldo – both for disambiguation of ambiguous Bring entries and for addition of new entries to Bring.
An Excellency Research Project called “Terminology of olive oil and trade: China and other international markets” (P07-HUM-03041) was initiated under my management in 2008, financed by the Andalusian regional government, the Junta de Andalucía. The project, known as “OLIVATERM”, had two main objectives: on the one hand, to develop the first systematic multilingual terminological dictionary in the scientific and socio-economic area of the olive grove and olive oils in order to facilitate communication in the topic; on the other, to contribute to the expansion of the Andalusia’s domestic and international trade and the dissemination of its culture. The main outcome of the research was the Diccionario de términos del aceite de oliva (DTAO – Dictionary of olive oil terms) (Roldán Vendrell, Arco Libros: 2013). This dictionary is currently the main reference source for answering queries and responding to any doubts that might arise in the use of this terminology in the three reference languages (Spanish, English and Chinese). It has received unanimous acknowledgement from numerous specialists in the sphere of Terminology, including most especially Maria Teresa Cabré (UPF), Miguel Casas Gómez (UCA- Ibérica 27 (2014): 217-234), François Maniez (Université de Lyon), Maria Isabel Santamaría Pérez and Chelo Vargas Sierra (UA), Pamela Faber (UGR), Joaquín García Palacios (USAL), and Marie-Claude L’Homme (Université de Montréal). The DTAO is well-known in the academic area of Terminology, but has not reached many of the institutions and organizations (domestic and international), translators, journalists, communicators and olive oil sector professionals that could benefit from it in their professions, especially salespeople, who need (fortunately, with an ever greater frequency) information on terminology in the book’s target languages for their commercial transactions. That is why we are currently working on a multichannel technological solution that enables a greater and more efficient transfer to the business sector: the design and development of an adaptive website (responsive web design) that provides access to the information in any usage context. We believe that access must be afforded to this valuable reference information on a hand-held device that enables it to be looked up both on- and offline and so pre-empt situations in which it is impossible to connect to the internet. The web application’s database will therefore also feed a series of mobile applications that will be available for the main platforms (iOS, Android). This tool will represent real progress in the dynamic transfer of specialized knowledge in the field of olive growing and olive oil production. Apart from delivering universal and free access to this information, the web application will welcome user suggestions for including new terms, new information and new reference languages, making it a collaborative tool that is also fed by its own users. With this tool we hope to respond to society’s needs for multilingual communication in the area of olive oil and to help give a boost to economic activity in the olive sector. In this work, in parallel to the presentation of the adaptive website, we will present a lexical repertoire integrated by new terms and expressions coined in this field (in the three working languages) in the last years. These neologisms reflect the most relevant innovations occurred in the olive oil sector over the last decade and, therefore, they must be compiled, sorted, systematized, and made accessible to the users in the web application we intend to develop.
Thanks to new technologies, the elaboration of specialized bilingual dictionaries can be made faster and more standardized, offering not only a dictionary of equivalents, but also the representation of a conceptual field. Nevertheless, in view of these new tools and services, some of which are offered free of charge by European institutions, it is necessary to question the viability of their use by a lambda user and the previous knowledge required for such use, as well as the possible problems they may encounter. In our communication we show a series of possible difficulties, as well as a methodological proposal and some solutions, by presenting an extract of a French-Spanish bilingual dictionary for the domain of architecture. The extract in question is a sample of about 30 terms created with the Lexonomy dictionary editor (Měchura 2017).
This paper describes RACAI’s word sense alignment system, which participated in the Monolingual Word Sense Alignment shared task organized at GlobaLex 2020 workshop. We discuss the system architecture, some of the challenges that we faced as well as present our results on several of the languages available for the task.
In this paper we describe the system submitted to the ELEXIS Monolingual Word Sense Alignment Task. We test different systems,which are two types of LSTMs and a system based on a pretrained Bidirectional Encoder Representations from Transformers (BERT)model, to solve the task. LSTM models use fastText pre-trained word vectors features with different settings. For training the models,we did not combine external data with the dataset provided for the task. We select a sub-set of languages among the proposed ones,namely a set of Romance languages, i.e., Italian, Spanish, Portuguese, together with English and Dutch. The Siamese LSTM withattention and PoS tagging (LSTM-A) performed better than the other two systems, achieving a 5-Class Accuracy score of 0.844 in theOverall Results, ranking the first position among five teams.
This paper describes our system for monolingual sense alignment across dictionaries. The task of monolingual word sense alignment is presented as a task of predicting the relationship between two senses. We will present two solutions, one based on supervised machine learning, and the other based on pre-trained neural network language model, specifically BERT. Our models perform competitively for binary classification, reporting high scores for almost all languages. This paper presents our submission for the shared task on monolingual word sense alignment across dictionaries as part of the GLOBALEX 2020 – Linked Lexicography workshop at the 12th Language Resources and Evaluation Conference (LREC). Monolingual word sense alignment (MWSA) is the task of aligning word senses across re- sources in the same language. Lexical-semantic resources (LSR) such as dictionaries form valuable foundation of numerous natural language process- ing (NLP) tasks. Since they are created manually by ex- perts, dictionaries can be considered among the resources of highest quality and importance. However, the existing LSRs in machine readable form are small in scope or miss- ing altogether. Thus, it would be extremely beneficial if the existing lexical resources could be connected and ex- panded. Lexical resources display considerable variation in the number of word senses that lexicographers assign to a given entry in a dictionary. This is because the identification and differentiation of word senses is one of the harder tasks that lexicographers face. Hence, the task of combining dictio- naries from different sources is difficult, especially for the case of mapping the senses of entries, which often differ significantly in granularity and coverage. (Ahmadi et al., 2020) There are three different angles from which the problem of word sense alignment can be addressed: approaches based on the similarity of textual descriptions of word senses, ap- proaches based on structural properties of lexical-semantic resources, and a combination of both. (Matuschek, 2014) In this paper we focus on the similarity of textual de- scriptions. This is a common approach as the majority of previous work used some notion of similarity between senses, mostly gloss overlap or semantic relatedness based on glosses. This makes sense, as glosses are a prerequisite for humans to recognize the meaning of an encoded sense, and thus also an intuitive way of judging the similarity of senses. (Matuschek, 2014) The paper is structured as follows: we provide a brief overview of related work in Section 2, and a description of the corpus in Section 3. In Section 4 we explain all impor- tant aspects of our model implementation, while the results are presented in Section 5. Finally, we end the paper with the discussion in Section 6 and conclusion in Section 7.
In this paper, we present the NUIG system at the TIAD shard task. This system includes graph-based metrics calculated using novel algorithms, with an unsupervised document embedding tool called ONETA and an unsupervised multi-way neural machine translation method. The results are an improvement over our previous system and produce the highest precision among all systems in the task as well as very competitive F-Measure results. Incorporating features from other systems should be easy in the framework we describe in this paper, suggesting this could very easily be extended to an even stronger result.
This paper describes our contribution to the Third Shared Task on Translation Inference across Dictionaries (TIAD-2020). We describe an approach on translation inference based on symbolic methods, the propagation of concepts over a graph of interconnected dictionaries: Given a mapping from source language words to lexical concepts (e.g., synsets) as a seed, we use bilingual dictionaries to extrapolate a mapping of pivot and target language words to these lexical concepts. Translation inference is then performed by looking up the lexical concept(s) of a source language word and returning the target language word(s) for which these lexical concepts have the respective highest score. We present two instantiations of this system: One using WordNet synsets as concepts, and one using lexical entries (translations) as concepts. With a threshold of 0, the latter configuration is the second among participant systems in terms of F1 score. We also describe additional evaluation experiments on Apertium data, a comparison with an earlier approach based on embedding projection, and an approach for constrained projection that outperforms the TIAD-2020 vanilla system by a large margin.
This paper describes the participation of two different approaches in the 3rd Translation Inference Across Dictionaries (TIAD 2020) shared task. The aim of the task is to automatically generate new bilingual dictionaries from existing ones. To that end, we essayed two different types of techniques: based on graph exploration on the one hand and, on the other hand, based on cross-lingual word embeddings. The task evaluation results show that graph exploration is very effective, accomplishing relatively high precision and recall values in comparison with the other participating systems, while cross-lingual embeddings reaches high precision but smaller recall.
This paper describes four different strategies proposed to the TIAD 2020 Shared Task for automatic translation inference across dictionaries. The proposed strategies are based on the analysis of Apertium RDF graph, taking advantage of characteristics such as translation using multiple paths, synonyms and similarities between lexical entries from different lexicons and cardinality of possible translations through the graph. The four strategies were trained and validated on the Apertium RDF EN<->ES dictionary, showing promising results. Finally, the strategies, applied together, obtained an F-measure of 0.43 in the task of inferring the dictionaries proposed in the shared task, ranking thus third with respect to the other new systems presented to the TIAD 2020 Shared Task. No system presented to the shared task exceeded the baseline proposed by the TIAD organizers.