This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we generate only three BibTeX files per volume, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
This paper presents Edie: ELEXIS DIctionary Evaluator. Edie is designed to create profiles for lexicographic resources accessible through the ELEXIS platform. These profiles can be used to evaluate and compare lexicographic resources, and in particular they can be used to identify potential data that could be linked.
We describe our current work for linking a new ontology for representing constitutive elements of Sign Languages with lexical data encoded within the OntoLex-Lemon framework. We first present very briefly the current state of the ontology, and show how transcriptions of signs can be represented in OntoLex-Lemon, in a minimalist manner, before addressing the challenges of linking the elements of the ontology to full lexical descriptions of the spoken languages.
Following presentations of frequency and attestations, and embeddings and distributional similarity, this paper introduces the third cornerstone of the emerging OntoLex module for Frequency, Attestation and Corpus-based Information, OntoLex-FrAC. We provide an RDF vocabulary for collocations, established as a consensus over contributions from five different institutions and numerous data sets, with the goal of eliciting feedback from reviewers, workshop audience and the scientific community in preparation of the final consolidation of the OntoLex-FrAC module, whose publication as a W3C community report is foreseen for the end of this year. The novel collocation component of OntoLex-FrAC is described in application to a lexicographic resource and corpus-based collocation scores available from the web, and finally, we demonstrate the capability and genericity of the model by showing how to retrieve and aggregate collocation information by means of SPARQL, and its export to a tabular format, so that it can be easily processed in downstream applications.
The objective of the Translation Inference Across Dictionaries (TIAD) series of shared tasks is to explore and compare methods and techniques that infer translations indirectly between language pairs, based on other bilingual/multilingual lexicographic resources. In this fifth edition, the participating systems were asked to generate new translations automatically among three languages - English, French, Portuguese - based on known indirect translations contained in the Apertium RDF graph. Such evaluation pairs have been the same during the four last TIAD editions. Since the fourth edition, however, a larger graph is used as a basis to produce the translations, namely Apertium RDF v2. The evaluation of the results was carried out by the organisers against manually compiled language pairs of K Dictionaries. For the second time in the TIAD series, some systems beat the proposed baselines. This paper gives an overall description of the shard task, the evaluation data and methodology, and the systems’ results.
To produce new bilingual dictionaries from existing ones, an important task in the field of translation, a system based on a very classical supervised learning technique, with no other knowledge than the available bilingual dictionaries, is proposed. It performed very well in the Translation Inference Across Dictionaries (TIAD) shared task on the combined 2021 and 2022 editions. An analysis of the pros and cons suggests a series of avenues to further improve its effectiveness.
Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform a rigorous manual evaluation of four different methods: word alignments on different types of bilingual data, pivoting, machine translation and cross-lingual word embeddings. We investigate how the different setups perform using publicly available data for the English-Icelandic language pair, doing separate evaluations for each method, dataset and confidence class where it can be calculated. The results are validated by human experts, working with a random sample from all our experiments. By combining the most promising approaches and data sets, using confidence scores calculated from the data and the results of manually evaluating samples from our manual evaluation as indicators, we are able to induce lists of translations with a very high acceptance rate. We show how multiple different combinations generate lists with well over 90% acceptance rate, substantially exceeding the results for each individual approach, while still generating reasonably large candidate lists. All manually evaluated equivalence pairs are published in a new lexicon of over 232,000 pairs under an open license.
Sense repositories are a key component of many NLP applications that require the identification of word senses. Many sense repositories exist: a large proportion is based on lexicographic resources such as WordNet and various dictionaries, but there are others which are the product of clustering algorithms and other automatic techniques. Over the years these repositories have been mapped to each other. However, there have been no attempts (until now) to provide any theoretical grounding for such mappings, causing inconsistencies and unintuitive results. The present paper draws on category theory to formalise assumptions about mapped repositories that are often left implicit, providing formal grounding for this type of language resource. The paper first gives an overview of the word sense disambiguation literature and four types of sense representations: dictionary definitions, clusters of senses, domain labels, and embedding vectors. These different sense representations make different assumptions about the relations and mappings between word senses. We then introduce notation to represent the mappings and repositories as a category, which we call a “sense system”. We represent a sense system as a small category S, where the object set of S, denoted by Ob(S), is a set of sense repositories; and the homomorphism set or hom-set of S, denoted by Hom(S), is a set of mappings between these repositories. On the basis of the sense system description, we propose, formalise, and motivate four basic and two guiding criteria for such sense systems. The four basic criteria are: 1) Correctness preservation: Mappings should preserve the correctness of sense labels in all contexts. Intuitively, if the correct sense for a word token is mapped to another sense, this sense should also be correct for that token. This criterion is endorsed by virtually all existing mappings, but the formalism presented in the paper makes this assumption explicit and allows us to distinguish it from other criteria. 2) Candidacy preservation: Mappings should preserve what we call “the lexical candidacy” of sense labels. Assume that a sense s is mapped to another sense s’ in a different repository. Candidacy preservation then requires that if s is a sense associated with word type w, then so is s’. This criterion is trivially fulfilled by clustering-based approaches, but is not typically explicitly stated for repositories, and we demonstrate how a violation might occur. Our formalisation allows us to specify the difference of this criterion to correctness preservation. As we argue, candidacy preservation allows us to straightforwardly and consistently compare granularity levels by counting the number of senses for each word type. 3) Uniqueness criterion: There should be at most one mapping from one repository to another. This criterion is also fulfilled by clustering-based approaches, but is often violated by repositories that use domain labels. We argue that adhering to the uniqueness criterion provides several benefits, including: a) being able to consistently convert between sets of labels and evaluation metrics, allowing researchers to work with data and models that use different sets of labels; b) ensuring that sense repositories would form a partial preorder, which would roughly correspond to the notion of granularity; and c) ensuring transitivity of mapped senses. 4) Connectivity: A sense system should be a connected category. The connectivity criterion on its own is not very informative, but it enables other criteria by extending their benefits to the rest of the sense system, such as allowing cross-checking between multiple repositories, allowing comparison of grain level, and label conversion. As we argue, connectivity should be considered a formal requirement helping to describe sense repositories and how they relate. We also offer two guiding criteria, which we consider aspirational rather than requirements that have to be strictly fulfilled for all purposes: 1) Non-contradiction: Mappings cannot exist between senses that semantically contradict each other. The non-contradiction criterion forbids mappings between senses whose (strict) implications contradict each other. We demonstrate how such a contradiction might occur, but acknowledge the difficulty in identifying such contradictions. As we argue, the reason to consider this a guiding rather than a strict criterion is that many sense repositories lack the semantic specificity that would allow researchers to identify these contradictions. 2) Inter-annotator agreement: Mappings should correspond to a partial preorder of inter-annotator agreement levels. It has been observed that, when annotating corpora with senses from a given sense repository, inter-annotator agreement tends to drop when the repository is more fine-grained. Therefore, if one repository is coarser-grained than another, one can expect agreement levels to be higher when annotating corpora with senses from the first repository. While this criterion will necessarily be subject to empirical variability (and does apply to sense repositories using non-interpretable representations such as embeddings), we argue that strong violations suggest that the sense distinctions of the coarse-grained sense repository are unnatural, i.e. not in accordance with human linguistic intuitions. Our list is by no means exhaustive, as there are other properties that may be desirable depending on the downstream application. Our category-theory based formalism will serve as the basis for describing any such further properties. However, we also envision that the criteria we have proposed will serve as guidelines for future sense repositories and mappings, in order to avoid the inconsistencies and counterintuitive results derived from existing mappings.
This work combines two lexical resources with morphological information on German word formation, CELEX for German and the latest release of GermaNet, for extracting and building complex word structures. This yields a database of over 100,000 German wordtrees. A definition for sequential morphological analyses leads to a Ontolex-Lemon type model. By using GermaNet sense information, the data can be linked to other semantic resources. An alignment to the CIDOC Conceptual Reference Model (CIDOC-CRM) is also provided. The scripts for the data generation are publicly available on GitHub.
Macedonian adjectives are inflected for gender, number, definiteness and degree, with in average 47.98 inflections per headword. The inflection paradigm of qualificative adjectives is even richer, embracing 56.27 morphophonemic alterations. Depending on the word they were derived from, more than 600 Macedonian adjectives have an identical headword and two different word forms for each grammatical category. While non-verbal adjectives alter the root before adding the inflectional suffixes, suffixes of verbal adjectives are added directly to the root. In parallel with the morphological differences, both types of adjectives have a different translation, depending on the category of the words they have been derived from. Nouns that collocate with these adjectives are mutually disjunctive, enabling the resolution of inflectional ambiguity. They are organised as a lexical taxonomy, created using hierarchical divisive clustering. If embedded in the future spell-checking applications, this taxonomy will significantly reduce the risk of forming incorrect inflections, which frequently occur in the daily news and more often in the advertisements and social media.
MorphoLex is a study in which root, prefix and suffixes of words are analyzed. With MorphoLex, many words can be analyzed according to certain rules and a useful database can be created. Due to the fact that Turkish is an agglutinative language and the richness of its language structure, it offers different analyzes and results from previous studies in MorphoLex. In this study, we revealed the process of creating a database with 48,472 words and the results of the differences in language structure.
Wordnets have been popular tools for providing and representing semantic and lexical relations of languages. They are useful tools for various purposes in NLP studies. Many researches created WordNets for different languages. For Turkish, there are two WordNets, namely the Turkish WordNet of BalkaNet and KeNet. In this paper, we present new WordNets for Turkish each of which is based on one of the first 9 editions of the Turkish dictionary starting from the 1944 edition. These WordNets are historical in nature and make implications for Modern Turkish. They are developed by extending KeNet, which was created based on the 2005 and 2011 editions of the Turkish dictionary. In this paper, we explain the steps in creating these 9 new WordNets for Turkish, discuss the challenges in the process and report comparative results about the WordNets.
This paper aims to present WordNet and Wikipedia connection by linking synsets from Turkish WordNet KeNet with Wikipedia and thus, provide a better machine-readable dictionary to create an NLP model with rich data. For this purpose, manual mapping between two resources is realized and 11,478 synsets are linked to Wikipedia. In addition to this, automatic linking approaches are utilized to analyze possible connection suggestions. Baseline Approach and ElasticSearch Based Approach help identify the potential human annotation errors and analyze the effectiveness of these approaches in linking. Adopting both manual and automatic mapping provides us with an encompassing resource of WordNet and Wikipedia connections.
A widely acknowledged shortcoming of WordNet is that it lacks a distinction between word meanings which are systematically related (polysemy), and those which are coincidental (homonymy). Several previous works have attempted to fill this gap, by inferring this information using computational methods. We revisit this task, and exploit recent advances in language modelling to synthesise homonymy annotation for Princeton WordNet. Previous approaches treat the problem using clustering methods; by contrast, our method works by linking WordNet to the Oxford English Dictionary, which contains the information we need. To perform this alignment, we pair definitions based on their proximity in an embedding space produced by a Transformer model. Despite the simplicity of this approach, our best model attains an F1 of .97 on an evaluation set that we annotate. The outcome of our work is a high-quality homonymy annotation layer for Princeton WordNet, which we release.