Christian Wolff

2026

nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM’s PagedAttention mechanism for efficient key–value cache reuse. Evaluation across 6 languages and 8 language–domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.

bib abs

Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Models
Niklas Donhauser | Jakob Fehle | Nils Constantin Hellwig | Markus Weinberger | Udo Kruschwitz | Christian Wolff
Proceedings of the Fourth Workshop on the Role of Resources in the Age of Large Language Models (RESOURCEFUL 2026)

Aspect-Based Sentiment Analysis (ABSA) enables fine-grained opinion analysis by identifying sentiments toward specific aspects or targets within a text. While ABSA has been widely studied for English, research on other languages such as German remains limited, largely due to the lack of high-quality annotated datasets. This paper examines how different annotation sources influence the development of German ABSA. To this end, an existing dataset is re-annotated by experts to establish a ground truth, which serves as a reference for evaluating annotations produced by students, crowdworkers, Large Language Models (LLMs), and experts. Annotation quality is compared using Inter-Annotator Agreement (IAA) and its impact on downstream model performance for different ABSA subtasks. The evaluation focuses on Aspect Category Sentiment Analysis (ACSA) and Target Aspect Sentiment Detection (TASD). We apply State-of-the-Art (SOTA) methods for ABSA, including BERT-, T5-, and LLaMA-based approaches to assess performance differences, spanning fine-tuning and in-context learning with instruction prompts. The findings provide practical insights into trade-offs between annotation reliability, and efficiency, offering guidance for dataset construction in under-resourced Natural Language Processing (NLP) scenarios.

bib abs

Mobilize, Inform, Interact: Classifying Political Calls-to-Action Types on Instagram
Michael Achmann-Denkler | Clara Helmig | Jakob Fehle | Mario Haim | Christian Wolff
Proceedings of the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026)

Calls-to-action (CTAs) are central to digital campaigning, yet computational research has largely focused on binary detection only. We address CTA type classification in German Instagram campaign texts (posts and ephemeral stories), distinguishing Support, Inform, Interact, and No CTA. With limited annotated data, we benchmark a fine-tuned GBERT model against GPT models using zero-shot, few-shot, and retrieval-augmented few-shot prompting in a multi-label setup. Both approaches reach similar performance in five-fold cross-validation (macro-F1 ca. 0.79), with persistent difficulty on the rare Interact category. As a proof of concept, we apply the selected setup to the 2021 federal election corpus and show that parties varied not only in overall CTA use but also in how they balanced appeals across posts versus stories. The results demonstrate the feasibility of CTA type classification with modest data and position retrieval-augmented prompting as a practical alternative to supervised fine-tuning.

bib abs

Posts Talk Policy, Stories Don’t: Policy-Issue Detection on Instagram with Fine-Tuned Transformers and Prompted LLMs
Michael Achmann-Denkler | Mario Haim | Christian Wolff
Proceedings of the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026)

Policy issues are central to election campaigns, yet systematic analyses of issue communication on Instagram remain scarce — particularly for ephemeral Stories. We develop and evaluate automated methods for detecting the binary presence of policy issues in Instagram posts and Stories from the 2021 German federal election. Drawing on a gold-standard dataset of 1,357 annotated documents across three textual channels (captions, OCR-extracted image text, and speech transcripts), we compare a fine-tuned German transformer (GBERT) with multiple LLM prompting strategies (zero-shot, few-shot, retrieval-augmented). Both approaches prove effective: GBERT achieves a cross-validated macro F1 of 0.90, closely matched by GPT-o3 under few-shot prompting (0.88). Substantively, policy visibility varies far more by content format than by party: 70% of posts contain policy references compared to only 17% of Stories, a pattern that holds consistently across all eight parties. An exploratory topic model confirms that parties reproduce familiar issue-ownership profiles within the subset of policy-relevant texts. Our results establish binary issue detection as a feasible foundation for studying policy communication in multimodal, ephemeral social media environments.

bib abs

Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis
Jakob Fehle | Nils Constantin Hellwig | Udo Kruschwitz | Christian Wolff
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.

bib abs

AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Proceedings of the Fifteenth Language Resources and Evaluation Conference

We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.

bib abs

LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.

2025

pdf bib abs

Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.

pdf bib

German Aspect-based Sentiment Analysis in the Wild: B2B Dataset Creation and Cross-Domain Evaluation
Jakob Fehle | Niklas Donhauser | Udo Kruschwitz | Nils Constantin Hellwig | Christian Wolff
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

2024

pdf bib

GERestaurant: A German Dataset of Annotated Restaurant Reviews for Aspect-Based Sentiment Analysis
Nils Constantin Hellwig | Jakob Fehle | Markus Bink | Christian Wolff
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

pdf bib abs

Detecting Calls to Action in Multimodal Content: Analysis of the 2021 German Federal Election Campaign on Instagram
Michael Achmann-Denkler | Jakob Fehle | Mario Haim | Christian Wolff
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers

This study investigates the automated classification of Calls to Action (CTAs) within the 2021 German Instagram election campaign to advance the understanding of mobilization in social media contexts. We analyzed over 2,208 Instagram stories and 712 posts using fine-tuned BERT models and OpenAI’s GPT-4 models. The fine-tuned BERT model incorporating synthetic training data achieved a macro F1 score of 0.93, demonstrating a robust classification performance. Our analysis revealed that 49.58% of Instagram posts and 10.64% of stories contained CTAs, highlighting significant differences in mobilization strategies between these content types. Additionally, we found that FDP and the Greens had the highest prevalence of CTAs in posts, whereas CDU and CSU led in story CTAs.

pdf bib abs

Divergent Discourses: A Comparative Examination of Blackout Tuesday and #BlackLivesMatter on Instagram
Aenne Knierim | Michael Achmann | Ulrich Heid | Christian Wolff
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

On May 25th, 2020, a viral eleven-minute clip showing the murder of George Floyd sparked international outrage and solidarity, leading to the digital memorial event Blackout Tuesday on Instagram. We analyzed posts to compare Blackout Tuesday discourse with #blacklivesmatter movement conversations. Using topic modeling, we identified dominant themes and counter-narratives in Blackout Tuesday and #blacklivesmatter captions. Using hashtag co-occurrence analysis, we investigatehashtag networks to situate the discourses within spheres of Instagram activism. Our findings indicate that both corpora share themes like “calls to action”, but Blackout Tuesday posts are shorter and solidarity-focused, while #blacklivesmatter posts are longer and address white privilege more explicitly. #blacklivesmatter is linked to anti-racist activism hashtags, while Blackout Tuesday connects more with popular culture and #Alllivesmatter. This supports qualitative research on Blackout Tuesday’s performative allyship, adding a quantitative perspective to the field.

We present results of a project on emotion classification on historical German plays of Enlightenment, Storm and Stress, and German Classicism. We have developed a hierarchical annotation scheme consisting of 13 sub-emotions like suffering, love and joy that sum up to 6 main and 2 polarity classes (positive/negative). We have conducted textual annotations on 11 German plays and have acquired over 13,000 emotion annotations by two annotators per play. We have evaluated multiple traditional machine learning approaches as well as transformer-based models pretrained on historical and contemporary language for a single-label text sequence emotion classification for the different emotion categories. The evaluation is carried out on three different instances of the corpus: (1) taking all annotations, (2) filtering overlapping annotations by annotators, (3) applying a heuristic for speech-based analysis. Best results are achieved on the filtered corpus with the best models being large transformer-based models pretrained on contemporary German language. For the polarity classification accuracies of up to 90% are achieved. The accuracies become lower for settings with a higher number of classes, achieving 66% for 13 sub-emotions. Further pretraining of a historical model with a corpus of dramatic texts led to no improvements.

pdf bib

Lexicon-based Sentiment Analysis in German: Systematic Evaluation of Resources and Preprocessing Techniques
Jakob Fehle | Thomas Schmidt | Christian Wolff
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

2016

pdf bib abs

Creating a Lexicon of Bavarian Dialect by Means of Facebook Language Data and Crowdsourcing
Manuel Burghardt | Daniel Granvogl | Christian Wolff
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Data acquisition in dialectology is typically a tedious task, as dialect samples of spoken language have to be collected via questionnaires or interviews. In this article, we suggest to use the “web as a corpus” approach for dialectology. We present a case study that demonstrates how authentic language data for the Bavarian dialect (ISO 639-3:bar) can be collected automatically from the social network Facebook. We also show that Facebook can be used effectively as a crowdsourcing platform, where users are willing to translate dialect words collaboratively in order to create a common lexicon of their Bavarian dialect. Key insights from the case study are summarized as “lessons learned”, together with suggestions for future enhancements of the lexicon creation approach.