2012
pdf
abs
Iterative Refinement and Quality Checking of Annotation Guidelines — How to Deal Effectively with Semantically Sloppy Named Entity Types, such as Pathological Phenomena
Udo Hahn
|
Elena Beisswanger
|
Ekaterina Buyko
|
Erik Faessler
|
Jenny Traumüller
|
Susann Schröder
|
Kerstin Hornbostel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We here discuss a methodology for dealing with the annotation of semantically hard to delineate, i.e., sloppy, named entity types. To illustrate sloppiness of entities, we treat an example from the medical domain, namely pathological phenomena. Based on our experience with iterative guideline refinement we propose to carefully characterize the thematic scope of the annotation by positive and negative coding lists and allow for alternative, short vs. long mention span annotations. Short spans account for canonical entity mentions (e.g., standardized disease names), while long spans cover descriptive text snippets which contain entity-specific elaborations (e.g., anatomical locations, observational details, etc.). Using this stratified approach, evidence for increasing annotation performance is provided by kappa-based inter-annotator agreement measurements over several, iterative annotation rounds using continuously refined guidelines. The latter reflects the increasing understanding of the sloppy entity class both from the perspective of guideline writers and users (annotators). Given our data, we have gathered evidence that we can deal with sloppiness in a controlled manner and expect inter-annotator agreement values around 80% for PathoJen, the pathological phenomena corpus currently under development in our lab.
2010
pdf
abs
The GeneReg Corpus for Gene Expression Regulation Events — An Overview of the Corpus and its In-Domain and Out-of-Domain Interoperability
Ekaterina Buyko
|
Elena Beisswanger
|
Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Despite the large variety of corpora in the biomedical domain their annotations differ in many respects, e.g., the coverage of different, highly specialized knowledge domains, varying degrees of granularity of targeted relations, the specificity of linguistic anchoring of relations and named entities in documents, etc. We here present GeneReg (Gene Regulation Corpus), the result of an annotation campaign led by the Jena University Language & Information Engineering (JULIE) Lab. The GeneReg corpus consists of 314 abstracts dealing with the regulation of gene expression in the model organism E. coli. Our emphasis in this paper is on the compatibility of the GeneReg corpus with the alternative Genia event corpus and with several in-domain and out-of-domain lexical resources, e.g., the Specialist Lexicon, FrameNet, and WordNet. The links we established from the GeneReg corpus to these external resources will help improve the performance of the automatic relation extraction engine JREx trained and evaluated on GeneReg.
pdf
abs
The CALBC Silver Standard Corpus for Biomedical Named Entities — A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers
Dietrich Rebholz-Schuhmann
|
Antonio José Jimeno Yepes
|
Erik M. van Mulligen
|
Ning Kang
|
Jan Kors
|
David Milward
|
Peter Corbett
|
Ekaterina Buyko
|
Katrin Tomanek
|
Elena Beisswanger
|
Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The production of gold standard corpora is time-consuming and costly. We propose an alternative: the âsilver standard corpus (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.
pdf
A Proposal for a Configurable Silver Standard
Udo Hahn
|
Katrin Tomanek
|
Elena Beisswanger
|
Erik Faessler
Proceedings of the Fourth Linguistic Annotation Workshop
2008
pdf
abs
Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab
Udo Hahn
|
Elena Beisswanger
|
Ekaterina Buyko
|
Michael Poprat
|
Katrin Tomanek
|
Joachim Wermter
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference relations in particular.
pdf
Building a BioWordNet Using WordNet Data Structures and WordNet’s Software Infrastructure–A Failure Story
Michael Poprat
|
Elena Beisswanger
|
Udo Hahn
Software Engineering, Testing, and Quality Assurance for Natural Language Processing