Nils Constantin Hellwig

2026

nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM’s PagedAttention mechanism for efficient key–value cache reuse. Evaluation across 6 languages and 8 language–domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.

pdf bib abs

schmerle at SemEval-2026 Task 4: Exploring Large Language Model Prompting Strategies for Low-Resource Narrative Similarity Detection
Maximilian Schmerle | Nils Constantin Hellwig
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Narrative similarity detection has broad applications in plagiarism detection, content recommendation, and comparative narrative analysis. We present a training-free, prompting-only framework for SemEval-2026 Task 4 (Track A), which requires identifying which of two candidate stories is narratively more similar to a given anchor story. Without any fine-tuning or additional annotations, we systematically evaluate three prompt templates across five structural prompting strategies, including zero-shot and few-shot inference, narrative summarization, keyword extraction, aspect splitting, and pairwise comparison. Structured prompt templates and decomposed pairwise comparisons consistently outperform baseline configurations, achieving a peak accuracy of 72.50% on the test set and 67.75% on the final leaderboard (23th out of 44 teams).

2025

pdf bib abs

Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.

pdf bib

German Aspect-based Sentiment Analysis in the Wild: B2B Dataset Creation and Cross-Domain Evaluation
Jakob Fehle | Niklas Donhauser | Udo Kruschwitz | Nils Constantin Hellwig | Christian Wolff
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers