Andreas Blombach

2025

pdf bib abs
Narrlangen at SemEval-2025 Task 10: Comparing (mostly) simple multilingual approaches to narrative classification
Andreas Blombach | Bao Minh Doan Dang | Stephanie Evert | Tamara Fuchs | Philipp Heinrich | Olena Kalashnikova | Naveed Unjum
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Our team focused on Subtask 2 (narrative classification) and tried several conceptually straightforward approaches: (1) prompt engineering of LLMs, (2) a zero-shot approach based on sentence similarities, (3) direct classification of fine-grained labels using SetFit, (4) fine-tuning encoder models on fine-grained labels, and (5) hierarchical classification using encoder models with two different classification heads. We list results for all systems on the development set, which show that the best approach was to fine-tune a pre-trained multilingual model, XLM-RoBERTa, with two additional linear layers and a softmax as classification head.

2024

pdf bib abs
Automatic Identification of COVID-19-Related Conspiracy Narratives in German Telegram Channels and Chats
Philipp Heinrich | Andreas Blombach | Bao Minh Doan Dang | Leonardo Zilio | Linda Havenstein | Nathan Dykes | Stephanie Evert | Fabian Schäfer
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We are concerned with mapping the discursive landscape of conspiracy narratives surrounding the COVID-19 pandemic. In the present study, we analyse a corpus of more than 1,000 German Telegram posts tagged with 14 fine-grained conspiracy narrative labels by three independent annotators. Since emerging narratives on social media are short-lived and notoriously hard to track, we experiment with different state-of-the-art approaches to few-shot and zero-shot text classification. We report performance in terms of ROC-AUC and in terms of optimal F1, and compare fine-tuned methods with off-the-shelf approaches and human performance.

2020

pdf bib abs
EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus
Thomas Proisl | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Andreas Blombach | Stefan Evert
Proceedings of the Twelfth Language Resources and Evaluation Conference

The EmpiriST corpus (Beißwenger et al., 2016) is a manually tokenized and part-of-speech tagged corpus of approximately 23,000 tokens of German Web and CMC (computer-mediated communication) data. We extend the corpus with manually created annotation layers for word form normalization, lemmatization and lexical semantics. All annotations have been independently performed by multiple human annotators. We report inter-annotator agreements and results of baseline systems and state-of-the-art off-the-shelf tools.

pdf bib abs
A Corpus of German Reddit Exchanges (GeRedE)
Andreas Blombach | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Thomas Proisl
Proceedings of the Twelfth Language Resources and Evaluation Conference

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.