Anna Braasch

2016

We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.

2015

2012

ESICT (Experience-oriented Sharing of health knowledge via Information and Communication Technology) is an ongoing research project funded by the Danish Council for Strategic Research. It aims at developing a health/disease related information system based on information technology, language technology, and formalized medical knowledge. The formalized medical knowledge consists partly of the terminology database SNOMED CT and partly of authorized medical texts on the domain. The system will allow users to ask questions in Danish and will provide natural language answers. Currently, the project is pursuing three basically different methods for question answering, and they are all described to some extent in this paper. A system prototype will handle questions related to diabetes and heart diseases. This paper concentrates on the methods employed for question answering and the language resources that are utilized. Some resources were existing, such as SNOMED CT, others, such as a corpus of sample questions, have had to be created or constructed.

2010

pdf abs
Merging Specialist Taxonomies and Folk Taxonomies in Wordnets - A case Study of Plants, Animals and Foods in the Danish Wordnet
Bolette S. Pedersen | Sanni Nimb | Anna Braasch
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we investigate the problem of merging specialist taxonomies with the more intuitive folk taxonomies in lexical-semantic resources like wordnets; and we focus in particular on plants, animals and foods. We show that a traditional dictionary like Den Danske Ordbog (DDO) survives well with several inconsistencies between different taxonomies of the vocabulary and that a restructuring is therefore necessary in order to compile a consistent wordnet resource on its basis. To this end, we apply Cruses definitions for hyponymies, namely those of natural kinds (such as plants and animals) on the one hand and functional kinds (such as foods) on the other. We pursue this distinction in the development of the Danish wordnet, DanNet, which has recently been built on the basis of DDO and is made open source for all potential users at www.wordnet.dk. Not surprisingly, we conclude that cultural background influences the structure of folk taxonomies quite radically, and that wordnet builders must therefore consider these carefully in order to capture their central characteristics in a systematic way.

pdf abs
Quality Indicators of LSP Texts — Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Jakob Halskov | Dorte Haltrup Hansen | Anna Braasch | Sussi Olsen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and poor LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of best practice for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.

2009

pdf
What do we need to know about humans? A view into the DanNet database
Bolette Sandford Pedersen | Anna Braasch
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf abs
Merging a Syntactic Resource with a WordNet: a Feasibility Study of a Merge between STO and DanNet
Bolette Sandford Pedersen | Anna Braasch | Lina Henriksen | Sussi Olsen | Claus Povlsen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents a feasibility study of a merge between SprogTeknologisk Ordbase (STO), which contains morphological and syntactic information, and DanNet, which is a Danish WordNet containing semantic information in terms of synonym sets and semantic relations. The aim of the merge is to develop a richer, composite resource which we believe will have a broader usage perspective than the two seen in isolation. In STO, the organizing principle is based on the observable syntactic features of a lemmas near context (labeled syntactic units or SynUs). In contrast, the basic unit in DanNet is constituted by semantic senses or - in wordnet terminology - synonym sets (synsets). The merge of the two resources is thus basically to be understood as a linking between SynUs and synsets. In the paper we discuss which parts of the merge can be performed semi-automatically and which parts require manual linguistic matching procedures. We estimate that this manual work will amount to approx. 39% of the lexicon material.

2004

pdf abs
STO: A Danish Lexicon Resource - Ready for Applications
Anna Braasch | Sussi Olsen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper deals with the STO lexicon, the most comprehensive computational lexicon of Danish developed for NLP/HLT applications, which is now ready for use. Danish was one of the 12 EU-languages participating in the LE-PAROLE and SIMPLE projects; therefore it was obvious to continue this work building on our experience obtained from these projects. The material for Danish produced within these projects – further enriched with language-specific information - is incorporated into the STO lexicon. First, we describe the main characteristics of the lexical coverage and linguistic content of the STO lexicon; second, we present some recent uses and point to some prospective exploitations of the material. Finally, we outline an internet-based user interface, which allows for browsing through the complex information content of the STO lexical database and some other selected WRL’s for Danish.

Anna Braasch

2016

2015

2012

2010

2009

2008

2004

2002

2000

1990

Co-authors

Venues