Andreas Niekler


2022

pdf
Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers
Christopher Schröder | Andreas Niekler | Martin Potthast
Findings of the Association for Computational Linguistics: ACL 2022

Active learning is the iterative construction of a classification model through targeted labeling, enabling significant labeling cost savings. As most research on active learning has been carried out before transformer-based language models (“transformers”) became popular, despite its practical importance, comparably few papers have investigated how transformers can be combined with active learning to date. This can be attributed to the fact that using state-of-the-art query strategies for transformers induces a prohibitive runtime overhead, which effectively nullifies, or even outweighs the desired cost savings. For this reason, we revisit uncertainty-based query strategies, which had been largely outperformed before, but are particularly suited in the context of fine-tuning transformers. In an extensive evaluation, we connect transformers to experiments from previous research, assessing their performance on five widely used text classification benchmarks. For active learning with transformers, several other uncertainty-based approaches outperform the well-known prediction entropy query strategy, thereby challenging its status as most popular uncertainty baseline in active learning for text classification.

2021

pdf
Supporting Land Reuse of Former Open Pit Mining Sites using Text Classification and Active Learning
Christopher Schröder | Kim Bürgl | Yves Annanias | Andreas Niekler | Lydia Müller | Daniel Wiegreffe | Christian Bender | Christoph Mengs | Gerik Scheuermann | Gerhard Heyer
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Open pit mines left many regions worldwide inhospitable or uninhabitable. Many sites are left behind in a hazardous or contaminated state, show remnants of waste, or have other restrictions imposed upon them, e.g., for the protection of human or nature. Such information has to be permanently managed in order to reuse those areas in the future. In this work we present and evaluate an automated workflow for supporting the post-mining management of former lignite open pit mines in the eastern part of Germany, where prior to any planned land reuse, aforementioned information has to be acquired to ensure the safety and validity of such an endeavor. Usually, this information is found in expert reports, either in the form of paper documents, or in the best case as digitized unstructured text—all of them in German language. However, due to the size and complexity of these documents, any inquiry is tedious and time-consuming, thereby slowing down or even obstructing the reuse of related areas. Since no training data is available, we employ active learning in order to perform multi-label sentence classification for two categories of restrictions and seven categories of topics. The final system integrates optical character recognition (OCR), active-learning-based text classification, and geographic information system visualization in order to effectively extract, query, and visualize this information for any area of interest. Active learning and text classification results are twofold: Whereas the restriction categories were reasonably accurate (>0.85 F1), the seven topic-oriented categories seemed to be complex even for human annotators and achieved mediocre evaluation scores (<0.70 F1).

pdf
Press Freedom Monitor: Detection of Reported Press and Media Freedom Violations in Twitter and News Articles
Tariq Yousef | Antje Schlaf | Janos Borst | Andreas Niekler | Gerhard Heyer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Freedom of the press and media is of vital importance for democratically organised states and open societies. We introduce the Press Freedom Monitor, a tool that aims to detect reported press and media freedom violations in news articles and tweets. It is used by press and media freedom organisations to support their daily monitoring and to trigger rapid response actions. The Press Freedom Monitor enables the monitoring experts to get a fast overview over recently reported incidents and it has shown an impressive performance in this regard. This paper presents our work on the tool, starting with the training phase, which comprises defining the topic-related keywords to be used for querying APIs for news and Twitter content and evaluating different machine learning models based on a training dataset specifically created for our use case. Then, we describe the components of the production pipeline, including data gathering, duplicates removal, country mapping, case mapping and the user interface. We also conducted a usability study to evaluate the effectiveness of the user interface, and describe improvement plans for future work.

2018

pdf
ILCM - A Virtual Research Infrastructure for Large-Scale Qualitative Data
Andreas Niekler | Arnim Bleier | Christian Kahmann | Lisa Posch | Gregor Wiedemann | Kenan Erdogan | Gerhard Heyer | Markus Strohmaier
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2014

pdf
PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone
Christian Haenig | Andreas Niekler | Carsten Wuensch
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe a publicly available multilingual evaluation corpus for phrase-level Sentiment Analysis that can be used to evaluate real world applications in an industrial context. This corpus contains data from English and German Internet forums (1000 posts each) focusing on the automotive domain. The major topic of the corpus is connecting and using cellphones to/in cars. The presented corpus contains different types of annotations: objects (e.g. my car, my new cellphone), features (e.g. address book, sound quality) and phrase-level polarities (e.g. the best possible automobile, big problem). Each of the posts has been annotated by at least four different annotators ― these annotations are retained in their original form. The reliability of the annotations is evaluated by inter-annotator agreement scores. Besides the corpus data and format, we provide comprehensive corpus statistics. This corpus is one of the first lexical resources focusing on real world applications that analyze the voice of the customer which is crucial for various industrial use cases.

2012

pdf
Lexical Semantics and Distribution of Suffixes - A Visual Analysis
Christian Rohrdantz | Andreas Niekler | Annette Hautli | Miriam Butt | Daniel A. Keim
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH