Maximilian Heinrich

2025

pdf bib abs
Multi-Class versus Means-End: Assessing Classification Approaches for Argument Patterns
Maximilian Heinrich | Khalid Al Khatib | Benno Stein
Proceedings of the 12th Argument mining Workshop

In the study of argumentation, the schemes introduced by Walton et al. (2008) represent a significant advancement in understanding and analyzing the structure and function of arguments. Walton’s framework is particularly valuable for computational reasoning, as it facilitates the identification of argument patterns and the reconstruction of enthymemes. Despite its practical utility, automatically identifying these schemes remains a challenging problem. To aid human annotators, Visser et al. (2021) developed a decision tree for scheme classification. Building on this foundation, we propose a means-end approach to argument scheme classification that systematically leverages expert knowledge—encoded in a decision tree—to guide language models through a complex classification task. We assess the effectiveness of the means-end approach by conducting a comprehensive comparison with a standard multi-class approach across two datasets, applying both prompting and supervised learning methods to each approach. Our results indicate that the means-end approach, when combined with supervised learning, achieves scores only slightly lower than those of the multi-class classification approach. At the same time, the means-end approach enhances explainability by identifying the specific steps in the decision tree that pose the greatest challenges for each scheme—offering valuable insights for refining the overall means-end classification process.

pdf bib abs
Webis at CQs-Gen 2025: Prompting and Reranking for Critical Questions
Midhun Kanadan | Johannes Kiesel | Maximilian Heinrich | Benno Stein
Proceedings of the 12th Argument mining Workshop

This paper reports on the submission of team extitWebis to the Critical Question Generation shared task at the 12th Workshop on Argument Mining (ArgMining 2025). Our approach is a fully automated two-stage pipeline that first prompts a large language model (LLM) to generate candidate critical questions for a given argumentative intervention, and then reranks the generated questions as per a classifier’s confidence in their usefulness. For the generation stage, we tested zero-shot, few-shot, and chain-of-thought prompting strategies. For the reranking stage, we used a ModernBERT classifier that we fine-tuned on either the validation set or an augmented version. Among our submissions, the best-performing configuration achieved a test score of 0.57 and ranked 5th in the shared task. Submissions that use reranking consistently outperformed baseline submissions without reranking across all metrics. Our results demonstrate that combining openweight LLMs with reranking significantly improves the quality of the resulting critical questions.

2024

While human values play a crucial role in making arguments persuasive, we currently lack the necessary extensive datasets to develop methods for analyzing the values underlying these arguments on a large scale. To address this gap, we present the Touché23-ValueEval dataset, an expansion of the Webis-ArgValues-22 dataset. We collected and annotated an additional 4780 new arguments, doubling the dataset’s size to 9324 arguments. These arguments were sourced from six diverse sources, covering religious texts, community discussions, free-text arguments, newspaper editorials, and political debates. Each argument is annotated by three crowdworkers for 54 human values, following the methodology established in the original dataset. The Touché23-ValueEval dataset was utilized in the SemEval 2023 Task 4. ValueEval: Identification of Human Values behind Arguments, where an ensemble of transformer models demonstrated state-of-the-art performance. Furthermore, our experiments show that a fine-tuned large language model, Llama-2-7B, achieves comparable results.

2023

pdf bib abs
SemEval-2023 Task 4: ValueEval: Identification of Human Values Behind Arguments
Johannes Kiesel | Milad Alshomary | Nailia Mirzakhmedova | Maximilian Heinrich | Nicolas Handke | Henning Wachsmuth | Benno Stein
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Argumentation is ubiquitous in natural language communication, from politics and media to everyday work and private life. Many arguments derive their persuasive power from human values, such as self-directed thought or tolerance, albeit often implicitly. These values are key to understanding the semantics of arguments, as they are generally accepted as justifications for why a particular option is ethically desirable. Can automated systems uncover the values on which an argument draws? To answer this question, 39 teams submitted runs to ValueEval’23. Using a multi-sourced dataset of over 9K arguments, the systems achieved F1-scores up to 0.87 (nature) and over 0.70 for three more of 20 universal value categories. However, many challenges remain, as evidenced by the low peak F1-score of 0.39 for stimulation, hedonism, face, and humility.

2022

pdf bib abs
Few-Shot Learning for Argument Aspects of the Nuclear Energy Debate
Lena Jurkschat | Gregor Wiedemann | Maximilian Heinrich | Mattes Ruckdeschel | Sunna Torge
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We approach aspect-based argument mining as a supervised machine learning task to classify arguments into semantically coherent groups referring to the same defined aspect categories. As an exemplary use case, we introduce the Argument Aspect Corpus - Nuclear Energy that separates arguments about the topic of nuclear energy into nine major aspects. Since the collection of training data for further aspects and topics is costly, we investigate the potential for current transformer-based few-shot learning approaches to accurately classify argument aspects. The best approach is applied to a British newspaper corpus covering the debate on nuclear energy over the past 21 years. Our evaluation shows that a stable prediction of shares of argument aspects in this debate is feasible with 50 to 100 training samples per aspect. Moreover, we see signals for a clear shift in the public discourse in favor of nuclear energy in recent years. This revelation of changing patterns of pro and contra arguments related to certain aspects over time demonstrates the potential of supervised argument aspect detection for tracking issue-specific media discourses.