Ekaterina Buyko

2012

pdf abs
Iterative Refinement and Quality Checking of Annotation Guidelines — How to Deal Effectively with Semantically Sloppy Named Entity Types, such as Pathological Phenomena
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Erik Faessler | Jenny Traumüller | Susann Schröder | Kerstin Hornbostel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We here discuss a methodology for dealing with the annotation of semantically hard to delineate, i.e., sloppy, named entity types. To illustrate sloppiness of entities, we treat an example from the medical domain, namely pathological phenomena. Based on our experience with iterative guideline refinement we propose to carefully characterize the thematic scope of the annotation by positive and negative coding lists and allow for alternative, short vs. long mention span annotations. Short spans account for canonical entity mentions (e.g., standardized disease names), while long spans cover descriptive text snippets which contain entity-specific elaborations (e.g., anatomical locations, observational details, etc.). Using this stratified approach, evidence for increasing annotation performance is provided by kappa-based inter-annotator agreement measurements over several, iterative annotation rounds using continuously refined guidelines. The latter reflects the increasing understanding of the sloppy entity class both from the perspective of guideline writers and users (annotators). Given our data, we have gathered evidence that we can deal with sloppiness in a controlled manner and expect inter-annotator agreement values around 80% for PathoJen, the pathological phenomena corpus currently under development in our lab.

2010

pdf
Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks
Ekaterina Buyko | Udo Hahn
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf abs
The GeneReg Corpus for Gene Expression Regulation Events — An Overview of the Corpus and its In-Domain and Out-of-Domain Interoperability
Ekaterina Buyko | Elena Beisswanger | Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Despite the large variety of corpora in the biomedical domain their annotations differ in many respects, e.g., the coverage of different, highly specialized knowledge domains, varying degrees of granularity of targeted relations, the specificity of linguistic anchoring of relations and named entities in documents, etc. We here present GeneReg (Gene Regulation Corpus), the result of an annotation campaign led by the Jena University Language & Information Engineering (JULIE) Lab. The GeneReg corpus consists of 314 abstracts dealing with the regulation of gene expression in the model organism E. coli. Our emphasis in this paper is on the compatibility of the GeneReg corpus with the alternative Genia event corpus and with several in-domain and out-of-domain lexical resources, e.g., the Specialist Lexicon, FrameNet, and WordNet. The links we established from the GeneReg corpus to these external resources will help improve the performance of the automatic relation extraction engine JREx trained and evaluated on GeneReg.

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the âsilver standard corpus (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.

2009

pdf
How Feasible and Robust is the Automatic Extraction of Gene Regulation Events? A Cross-Method Evaluation under Lab and Real-Life Conditions
Udo Hahn | Katrin Tomanek | Ekaterina Buyko | Jung-jae Kim | Dietrich Rebholz-Schuhmann
Proceedings of the BioNLP 2009 Workshop

pdf
Event Extraction from Trimmed Dependency Graphs
Ekaterina Buyko | Erik Faessler | Joachim Wermter | Udo Hahn
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task

2008

pdf
Are Morpho-Syntactic Features More Predictive for the Resolution of Noun Phrase Coordination Ambiguity than Lexico-Semantic Similarity Scores?
Ekaterina Buyko | Udo Hahn
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf abs
Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Michael Poprat | Katrin Tomanek | Joachim Wermter
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference relations in particular.

pdf abs
Ontology-Based Interface Specifications for a NLP Pipeline Architecture
Ekaterina Buyko | Christian Chiarcos | Antonio Pareja Lora
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The high level of heterogeneity between linguistic annotations usually complicates the interoperability of processing modules within an NLP pipeline. In this paper, a framework for the interoperation of NLP components, based on a data-driven architecture, is presented. Here, ontologies of linguistic annotation are employed to provide a conceptual basis for the tagset-neutral processing of linguistic annotations. The framework proposed here is based on a set of structured OWL ontologies: a reference ontology, a set of annotation models which formalize different annotation schemes, and a declarative linking between these, specified separately. This modular architecture is particularly scalable and flexible as it allows for the integration of different reference ontologies of linguistic annotations in order to overcome the absence of a consensus for an ontology of linguistic terminology. Our proposal originates from three lines of research from different fields: research on annotation type systems in UIMA; the ontological architecture OLiA, originally developed for sustainable documentation and annotation-independent corpus browsing, and the ontologies of the OntoTag model, targeted towards the processing of linguistic annotations in Semantic Web applications. We describe how UIMA annotations can be backed up by ontological specifications of annotation schemes as in the OLiA model, and how these are linked to the OntoTag ontologies, which allow for further ontological processing.

2007