Katrin Tomanek


2021

pdf bib
Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech
Katrin Tomanek | Vicky Zayats | Dirk Padfield | Kara Vaillancourt | Fadi Biadsy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.

2016

pdf bib
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)
Annemarie Friedrich | Katrin Tomanek
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

2011

pdf bib
Proceedings of the 5th Linguistic Annotation Workshop
Nancy Ide | Adam Meyers | Sameer Pradhan | Katrin Tomanek
Proceedings of the 5th Linguistic Annotation Workshop

2010

pdf bib
Annotation Time Stamps — Temporal Metadata from the Linguistic Annotation Process
Katrin Tomanek | Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe the re-annotation of selected types of named entities (persons, organizations, locations) from the Muc7 corpus. The focus of this annotation initiative is on recording the time needed for the linguistic process of named entity annotation. Annotation times are measured on two basic annotation units -- sentences vs. complex noun phrases. We gathered evidence that decision times are non-uniformly distributed over the annotation units, while they do not substantially deviate among annotators. This data seems to support the hypothesis that annotation times very much depend on the inherent ""hardness"" of each single annotation decision. We further show how such time-stamped information can be used for empirically grounded studies of selective sampling techniques, such as Active Learning. We directly compare Active Learning costs on the basis of token-based vs. time-based measurements. The data reveals that Active Learning keeps its competitive advantage over random sampling in both scenarios though the difference is less marked for the time metric than for the token metric.

pdf bib
The CALBC Silver Standard Corpus for Biomedical Named Entities — A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers
Dietrich Rebholz-Schuhmann | Antonio José Jimeno Yepes | Erik M. van Mulligen | Ning Kang | Jan Kors | David Milward | Peter Corbett | Ekaterina Buyko | Katrin Tomanek | Elena Beisswanger | Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the ‚silver standard corpus‘ (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.

pdf bib
A Comparison of Models for Cost-Sensitive Active Learning
Katrin Tomanek | Udo Hahn
Coling 2010: Posters

pdf bib
A Cognitive Cost Model of Annotations Based on Eye-Tracking Data
Katrin Tomanek | Udo Hahn | Steffen Lohmann | Jürgen Ziegler
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing
Burr Settles | Kevin Small | Katrin Tomanek
Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing

pdf bib
A Proposal for a Configurable Silver Standard
Udo Hahn | Katrin Tomanek | Elena Beisswanger | Erik Faessler
Proceedings of the Fourth Linguistic Annotation Workshop

2009

pdf bib
Semi-Supervised Active Learning for Sequence Labeling
Katrin Tomanek | Udo Hahn
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
An Intrinsic Stopping Criterion for Committee-Based Active Learning
Fredrik Olsson | Katrin Tomanek
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

pdf bib
How Feasible and Robust is the Automatic Extraction of Gene Regulation Events? A Cross-Method Evaluation under Lab and Real-Life Conditions
Udo Hahn | Katrin Tomanek | Ekaterina Buyko | Jung-jae Kim | Dietrich Rebholz-Schuhmann
Proceedings of the BioNLP 2009 Workshop

pdf bib
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Eric Ringger | Robbie Haertel | Katrin Tomanek
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
On Proper Unit Selection in Active Learning: Co-Selection Effects for Named Entity Recognition
Katrin Tomanek | Florian Laws | Udo Hahn | Hinrich Schütze
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
A Web Survey on the Use of Active Learning to Support Annotation of Text Data
Katrin Tomanek | Fredrik Olsson
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
Timed Annotations — Enhancing MUC7 Metadata by the Time It Takes to Annotate Named Entities
Katrin Tomanek | Udo Hahn
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf bib
Approximating Learning Curves for Active-Learning-Driven Annotation
Katrin Tomanek | Udo Hahn
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks. A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts. This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated. While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which - in realistic annotation settings, at least - is often unavailable. We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members. This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain. Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions.

pdf bib
Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Michael Poprat | Katrin Tomanek | Joachim Wermter
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference relations in particular.

pdf bib
Multi-Task Active Learning for Linguistic Annotations
Roi Reichart | Katrin Tomanek | Udo Hahn | Ari Rappoport
Proceedings of ACL-08: HLT

2007

pdf bib
An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data
Katrin Tomanek | Joachim Wermter | Udo Hahn
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Efficient Annotation with the Jena ANnotation Environment (JANE)
Katrin Tomanek | Joachim Wermter | Udo Hahn
Proceedings of the Linguistic Annotation Workshop

pdf bib
An Annotation Type System for a Data-Driven NLP Pipeline
Udo Hahn | Ekaterina Buyko | Katrin Tomanek | Scott Piao | John McNaught | Yoshimasa Tsuruoka | Sophia Ananiadou
Proceedings of the Linguistic Annotation Workshop