Katrin Tomanek

2023

Using pre-trained language models to implement classifiers from small to modest amounts of training data is an area of active research. The ability of large language models to generalize from few-shot examples and to produce strong classifiers is extended using the engineering approach of parameter-efficient tuning. Using the Explainable Detection of Online Sexism (EDOS) training data and a small number of trainable weights to create a tuned prompt vector, a competitive model for this task was built, which was top-ranked in Subtask B.

2022

pdf abs
Context-Aware Abbreviation Expansion Using Large Language Models
Shanqing Cai | Subhashini Venugopalan | Katrin Tomanek | Ajit Narayanan | Meredith Morris | Michael Brenner
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Motivated by the need for accelerating text entry in augmentative and alternative communication (AAC) for people with severe motor impairments, we propose a paradigm in which phrases are abbreviated aggressively as primarily word-initial letters. Our approach is to expand the abbreviations into full-phrase options by leveraging conversation context with the power of pretrained large language models (LLMs). Through zero-shot, few-shot, and fine-tuning experiments on four public conversation datasets, we show that for replies to the initial turn of a dialog, an LLM with 64B parameters is able to exactly expand over 70% of phrases with abbreviation length up to 10, leading to an effective keystroke saving rate of up to about 77% on these exact expansions. Including a small amount of context in the form of a single conversation turn more than doubles abbreviation expansion accuracies compared to having no context, an effect that is more pronounced for longer phrases. Additionally, the robustness of models against typo noise can be enhanced through fine-tuning on noisy data.

2021

pdf abs
Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech
Katrin Tomanek | Vicky Zayats | Dirk Padfield | Kara Vaillancourt | Fadi Biadsy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.

We describe the re-annotation of selected types of named entities (persons, organizations, locations) from the Muc7 corpus. The focus of this annotation initiative is on recording the time needed for the linguistic process of named entity annotation. Annotation times are measured on two basic annotation units -- sentences vs. complex noun phrases. We gathered evidence that decision times are non-uniformly distributed over the annotation units, while they do not substantially deviate among annotators. This data seems to support the hypothesis that annotation times very much depend on the inherent ""hardness"" of each single annotation decision. We further show how such time-stamped information can be used for empirically grounded studies of selective sampling techniques, such as Active Learning. We directly compare Active Learning costs on the basis of token-based vs. time-based measurements. The data reveals that Active Learning keeps its competitive advantage over random sampling in both scenarios though the difference is less marked for the time metric than for the token metric.

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the âsilver standard corpus (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.

2009

pdf
Semi-Supervised Active Learning for Sequence Labeling
Katrin Tomanek | Udo Hahn
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
An Intrinsic Stopping Criterion for Committee-Based Active Learning
Fredrik Olsson | Katrin Tomanek
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

pdf
How Feasible and Robust is the Automatic Extraction of Gene Regulation Events? A Cross-Method Evaluation under Lab and Real-Life Conditions
Udo Hahn | Katrin Tomanek | Ekaterina Buyko | Jung-jae Kim | Dietrich Rebholz-Schuhmann
Proceedings of the BioNLP 2009 Workshop

pdf bib
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Eric Ringger | Robbie Haertel | Katrin Tomanek
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
On Proper Unit Selection in Active Learning: Co-Selection Effects for Named Entity Recognition
Katrin Tomanek | Florian Laws | Udo Hahn | Hinrich Schütze
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf
A Web Survey on the Use of Active Learning to Support Annotation of Text Data
Katrin Tomanek | Fredrik Olsson
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf
Timed Annotations — Enhancing MUC7 Metadata by the Time It Takes to Annotate Named Entities
Katrin Tomanek | Udo Hahn
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf abs
Approximating Learning Curves for Active-Learning-Driven Annotation
Katrin Tomanek | Udo Hahn
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks. A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts. This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated. While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which - in realistic annotation settings, at least - is often unavailable. We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members. This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain. Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions.

pdf abs
Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Michael Poprat | Katrin Tomanek | Joachim Wermter
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference relations in particular.

pdf
Multi-Task Active Learning for Linguistic Annotations
Roi Reichart | Katrin Tomanek | Udo Hahn | Ari Rappoport
Proceedings of ACL-08: HLT