Sussi Olsen

2022

pdf abs
Compiling a Suitable Level of Sense Granularity in a Lexicon for AI Purposes: The Open Source COR Lexicon
Bolette Pedersen | Nathalie Carmen Hau Sørensen | Sanni Nimb | Ida Flørke | Sussi Olsen | Thomas Troelsgård
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present The Central Word Register for Danish (COR), which is an open source lexicon project for general AI purposes funded and initiated by the Danish Agency for Digitisation as part of an AI initiative embarked by the Danish Government in 2020. We focus here on the lexical semantic part of the project (COR-S) and describe how we – based on the existing fine-grained sense inventory from Den Danske Ordbog (DDO) – compile a more AI suitable sense granularity level of the vocabulary. A three-step methodology is applied: We establish a set of linguistic principles for defining core senses in COR-S and from there, we generate a hand-crafted gold standard of 6,000 lemmas depicting how to come from the fine-grained DDO sense to the COR inventory. Finally, we experiment with a number of language models in order to automatize the sense reduction of the rest of the lexicon. The models comprise a ruled-based model that applies our linguistic principles in terms of features, a word2vec model using cosine similarity to measure the sense proximity, and finally a deep neural BERT model fine-tuned on our annotations. The rule-based approach shows best results, in particular on adjectives, however, when focusing on the average polysemous vocabulary, the BERT model shows promising results too.

pdf abs
A Thesaurus-based Sentiment Lexicon for Danish: The Danish Sentiment Lexicon
Sanni Nimb | Sussi Olsen | Bolette Pedersen | Thomas Troelsgård
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes how a newly published Danish sentiment lexicon with a high lexical coverage was compiled by use of lexicographic methods and based on the links between groups of words listed in semantic order in a thesaurus and the corresponding word sense descriptions in a comprehensive monolingual dictionary. The overall idea was to identify negative and positive sections in a thesaurus, extract the words from these sections and combine them with the dictionary information via the links. The annotation task of the dataset included several steps, and was based on the comparison of synonyms and near synonyms within a semantic field. In the cases where one of the words were included in the smaller Danish sentiment lexicon AFINN, its value there was used as inspiration and expanded to the synonyms when appropriate. In order to obtain a more practical lexicon with overall polarity values at lemma level, all the senses of the lemma were afterwards compared, taking into consideration dictionary information such as usage, style and frequency. The final lexicon contains 13,859 Danish polarity lemmas and includes morphological information. It is freely available at https://github.com/dsldk/danish-sentiment-lexicon (licence CC-BY-SA 4.0 International).

2021

pdf abs
DanNet2: Extending the coverage of adjectives in DanNet based on thesaurus data (project presentation)
Sanni Nimb | Bolette Pedersen | Sussi Olsen
Proceedings of the 11th Global Wordnet Conference

The paper describes work in progress in the DanNet2 project financed by the Carlsberg Foundation. The project aim is to extend the original Danish wordnet, DanNet, in several ways. Main focus is on extension of the coverage and description of the adjectives, a part of speech that was rather sparsely described in the original wordnet. We describe the methodology and initial work of semi-automatically transferring adjectives from the Danish Thesaurus to the wordnet with the aim of easily enlarging the coverage from 3,000 to approx. 13,000 adjectival synsets. Transfer is performed by manually encoding all missing adjectival subsection headwords from the thesaurus and thereafter employing a semi-automatic procedure where adjectives from the same subsection are transferred to the wordnet as either 1) near synonyms to the section’s headword, 2) hyponyms to the section’s headword, or 3) as members of the same synset as the headword. We also discuss how to deal with the problem of multiple representations of the same sense in the thesaurus, and present other types of information from the thesaurus that we plan to integrate, such as thematic and sentiment information.

2020

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

2019

pdf abs
Merging DanNet with Princeton Wordnet
Bolette Sandford Pedersen | Sanni Nimb | Ida Rørmann Olsen | Sussi Olsen
Proceedings of the 10th Global Wordnet Conference

In this paper we describe the merge of the Danish wordnet, DanNet, with Princeton Wordnet applying a two-step approach. We first link from the English Princeton core to Danish (5,000 base concepts) and then proceed to linking the rest of the Danish vocabulary to English, thus going from Danish to English. Since the Danish wordnet is built bottom-up from Danish lexica and corpora, all taxonomies are monolingually based and thus not necessarily directly compatible with the coverage and structure of the Princeton WordNet. This fact proves to pose some challenges to the linking procedure since a considerable number of the links cannot be realised via the preferred cross-language synonym link which implies a more or less precise correlation between the two concepts. Instead, a subpart of the links are realised through near synonym or hyponymy links to compensate for the fact that no precise translation can be found in the target resource. The tool WordnetLoom is currently used for manual linking but procedures for a more automatic procedure in future is discussed. We conclude that the two resources actually differ from each other quite more than expected, both vocabulary and structure-wise.

2018

pdf abs
Towards a principled approach to sense clustering – a case study of wordnet and dictionary senses in Danish
Bolette Pedersen | Manex Agirrezabal | Sanni Nimb | Ida Olsen | Sussi Olsen
Proceedings of the 9th Global Wordnet Conference

Our aim is to develop principled methods for sense clustering which can make existing lexical resources practically useful in NLP – not too fine-grained to be operational and yet finegrained enough to be worth the trouble. Where traditional dictionaries have a highly structured sense inventory typically describing the vocabulary by means of mainand subsenses, wordnets are generally fine-grained and unstructured. We present a series of clustering and annotation experiments with 10 of the most polysemous nouns in Danish. We combine the structured information of a traditional Danish dictionary with the ontological types found in the Danish wordnet, DanNet. This constellation enables us to automatically cluster senses in a principled way and improve inter-annotator agreement and wsd performance.

pdf
A Danish FrameNet Lexicon and an Annotated Corpus Used for Training and Evaluating a Semantic Frame Classifier
Bolette Pedersen | Sanni Nimb | Anders Søgaard | Mareike Hartmann | Sussi Olsen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

Language resources (LR) are indispensable for the development of tools for machine translation (MT) or various kinds of computer-assisted translation (CAT). In particular language corpora, both parallel and monolingual are considered most important for instance for MT, not only SMT but also hybrid MT. The Language Technology Observatory will provide easy access to information about LRs deemed to be useful for MT and other translation tools through its LR Catalogue. In order to determine what aspects of an LR are useful for MT practitioners, a user study was made, providing a guide to the most relevant metadata and the most relevant quality criteria. We have seen that many resources exist which are useful for MT and similar work, but the majority are for (academic) research or educational use only, and as such not available for commercial use. Our work has revealed a list of gaps: coverage gap, awareness gap, quality gap, quantity gap. The paper ends with recommendations for a forward-looking strategy.

We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.

pdf abs
An empirically grounded expansion of the supersense inventory
Hector Martinez Alonso | Anders Johannsen | Sanni Nimb | Sussi Olsen | Bolette Pedersen
Proceedings of the 8th Global WordNet Conference (GWC)

In this article we present an expansion of the supersense inventory. All new super-senses are extensions of members of the current inventory, which we postulate by identifying semantically coherent groups of synsets. We cover the expansion of the already-established supernsense inventory for nouns and verbs, the addition of coarse supersenses for adjectives in absence of a canonical supersense inventory, and super-senses for verbal satellites. We evaluate the viability of the new senses examining the annotation agreement, frequency and co-ocurrence patterns.

2015

pdf bib
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015
Bolette Sandford Pedersen | Sussi Olsen | Lars Borin
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

pdf
Coarse-grained sense annotation of Danish across textual domains
Sussi Olsen | Bolette S. Pedersen | Héctor Martínez Alonso | Anders Johannsen
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

2014

pdf abs
Using TEI, CMDI and ISOcat in CLARIN-DK
Dorte Haltrup Hansen | Lene Offersgaard | Sussi Olsen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the challenges and issues encountered in the conversion of TEI header metadata into the CMDI format. The work is carried out in the Danish research infrastructure, CLARIN-DK, in order to enable the exchange of language resources nationally as well as internationally, in particular with other partners of CLARIN ERIC. The paper describes the task of converting an existing TEI specification applied to all the text resources deposited in DK-CLARIN. During the task we have tried to reuse and share CMDI profiles and components in the CLARIN Component Registry, as well as linking the CMDI components and elements to the relevant data categories in the ISOcat Data Category Registry. The conversion of the existing metadata into the CMDI format turned out not to be a trivial task and the experience and insights gained from this work have resulted in a proposal for a work flow for future use. We also present a core TEI header metadata set.

2013

2012

pdf abs
A Distributed Resource Repository for Cloud-Based Machine Translation
Jörg Tiedemann | Dorte Haltrup Hansen | Lene Offersgaard | Sussi Olsen | Matthias Zumpe
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present the architecture of a distributed resource repository developed for collecting training data for building customized statistical machine translation systems. The repository is designed for the cloud-based translation service integrated in the Let'sMT! platform which is about to be launched to the public. The system includes important features such as automatic import and alignment of textual documents in a variety of formats, a flexible database for meta-information using modern key-value stores and a grid-based backend for running off-line processes. The entire system is very modular and supports highly distributed setups to enable a maximum of flexibility and scalability. The system uses secure connections and includes an effective permission management to ensure data integrity. In this paper, we also take a closer look at the task of sentence alignment. The process of alignment is extremely important for the success of translation models trained on the platform. Alignment decisions significantly influence the quality of SMT engines.

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

2010

pdf abs
Quality Indicators of LSP Texts — Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Jakob Halskov | Dorte Haltrup Hansen | Anna Braasch | Sussi Olsen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and poor LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of best practice for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.

2008

pdf abs
Annotating Abstract Pronominal Anaphora in the DAD Project
Costanza Navarretta | Sussi Olsen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present an extension of the MATE/GNOME annotation scheme for anaphora (Poesio, 2004) which accounts for abstract anaphora in Danish and Italian. By abstract anaphora it is here meant pronouns whose linguistic antecedents are verbal phrases, clauses and discourse segments. The extended scheme, which we call the DAD annotation scheme, allows to annotate information about abstract anaphora which is important to investigate their use, see i.a. (Webber, 1988; Gundel et al., 2003; Navarretta, 2004; Navarretta, 2007) and which can influence their automatic treatment. Intercoder agreement scores obtained by applying the DAD annotation scheme on texts and dialogues in the two languages are given and show that the information proposed in the scheme can be recognised in a reliable way.

pdf abs
Merging a Syntactic Resource with a WordNet: a Feasibility Study of a Merge between STO and DanNet
Bolette Sandford Pedersen | Anna Braasch | Lina Henriksen | Sussi Olsen | Claus Povlsen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents a feasibility study of a merge between SprogTeknologisk Ordbase (STO), which contains morphological and syntactic information, and DanNet, which is a Danish WordNet containing semantic information in terms of synonym sets and semantic relations. The aim of the merge is to develop a richer, composite resource which we believe will have a broader usage perspective than the two seen in isolation. In STO, the organizing principle is based on the observable syntactic features of a lemmas near context (labeled syntactic units or SynUs). In contrast, the basic unit in DanNet is constituted by semantic senses or - in wordnet terminology - synonym sets (synsets). The merge of the two resources is thus basically to be understood as a linking between SynUs and synsets. In the paper we discuss which parts of the merge can be performed semi-automatically and which parts require manual linguistic matching procedures. We estimate that this manual work will amount to approx. 39% of the lexicon material.

2004

pdf abs
STO: A Danish Lexicon Resource - Ready for Applications
Anna Braasch | Sussi Olsen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper deals with the STO lexicon, the most comprehensive computational lexicon of Danish developed for NLP/HLT applications, which is now ready for use. Danish was one of the 12 EU-languages participating in the LE-PAROLE and SIMPLE projects; therefore it was obvious to continue this work building on our experience obtained from these projects. The material for Danish produced within these projects – further enriched with language-specific information - is incorporated into the STO lexicon. First, we describe the main characteristics of the lexical coverage and linguistic content of the STO lexicon; second, we present some recent uses and point to some prospective exploitations of the material. Finally, we outline an internet-based user interface, which allows for browsing through the complex information content of the STO lexical database and some other selected WRL’s for Danish.