Proceedings of the 12th Argument mining Workshop

Elena Chistova, Philipp Cimiano, Shohreh Haddadan, Gabriella Lapesa, Ramon Ruiz-Dolz (Editors)

Anthology ID:: 2025.argmining-1
Month:: July
Year:: 2025
Address:: Vienna, Austria
Venues:: ArgMining | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.argmining-1/
DOI:
ISBN:: 979-8-89176-258-9
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.argmining-1.pdf

pdf bib
Proceedings of the 12th Argument mining Workshop
Elena Chistova | Philipp Cimiano | Shohreh Haddadan | Gabriella Lapesa | Ramon Ruiz-Dolz

pdf bib abs
“The Facts Speak for Themselves”: GPT and Fallacy Classification
Erisa Bytyqi | Annette Hautli-Janisz

Fallacies are not only part and parcel of human communication, they are also important for generative models in that fallacies can be tailored to self-verify the output they generate. Previous work has shown that fallacy detection and classification is tricky, but the question that still remains is whether the use of theoretical explanations in prompting Large Language Models (LLMs) on the task enhances the performance of the models. In this paper we show that this is not the case: Using the pragma-dialectics approach to fallacies (van Eemeren, 1987), we show that three GPT models struggle with the task. Based on our own PD-oriented dataset of fallacies and an extension of an existing fallacy dataset from Jin et al. (2022), we show that this is not only the case for fallacies “in the wild”, but also for textbook examples of fallacious arguments. Our paper also supports the claim that LLMs generally lag behind in fallacy classification in comparison to smaller-scale neural models.

pdf bib abs
Exploring LLM Priming Strategies for Few-Shot Stance Classification
Yamen Ajjour | Henning Wachsmuth

Large language models (LLMs) are effective in predicting the labels of unseen target instances if instructed for the task and training instances via the prompt. LLMs generate a text with higher probability if the prompt contains text with similar characteristics, a phenomenon, called priming, that especially affects argumentation. An open question in NLP is how to systematically exploit priming to choose a set of instances suitable for a given task. For stance classification, LLMs may be primed with few-shot instances prior to identifying whether a given argument is pro or con a topic. In this paper, we explore two priming strategies for few-shot stance classification: one takes those instances that are most semantically similar, and the other chooses those that are most stance-similar. Experiments on three common stance datasets suggest that priming an LLM with stance-similar instances is particularly effective in few-shot stance classification compared to baseline strategies, and behaves largely consistently across different LLM variants.

In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking skills rather than replacing them. We introduce the concept of reasonable parrots that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.

pdf bib abs
Retrieving Argument Graphs Using Vision Transformers
Kilian Bartz | Mirko Lenz | Ralph Bergmann

Through manual annotation or automated argument mining processes, arguments can be represented not only as text, but also in structured formats like graphs. When searching for relevant arguments, this additional information about the relationship between their elementary units allows for the formulation of fine-grained structural constraints by using graphs as queries. Then, a retrieval can be performed by computing the similarity between the query and all available arguments. Previous works employed Graph Edit Distance (GED) algorithms such as A* search to compute mappings between nodes and edges for determining the similarity, which is rather expensive. In this paper, we propose an alternative based on Vision Transformers where arguments are rendered as images to obtain dense embeddings. We propose multiple space-filling visualizations and evaluate the retrieval performance of the vision-based approach against an existing A* search-based method. We find that our technique runs orders of magnitude faster than A* search and scales well on larger argument graphs while achieving competitive results.

pdf bib abs
Old but Gold: LLM-Based Features and Shallow Learning Methods for Fine-Grained Controversy Analysis in YouTube Comments
Davide Bassi | Erik Bran Marino | Renata Vieira | Martin Pereira

Online discussions can either bridge differences through constructive dialogue or amplify divisions through destructive interactions. paper proposes a computational approach to analyze dialogical relation patterns in YouTube comments, offering a fine-grained framework for controversy detection, enabling also analysis of individual contributions. experiments demonstrate that shallow learning methods, when equipped with these theoretically-grounded features, consistently outperform more complex language models in characterizing discourse quality at both comment-pair and conversation-chain levels.studies confirm that divisive rhetorical techniques serve as strong predictors of destructive communication patterns. work advances understanding of how communicative choices shape online discourse, moving beyond engagement metrics toward nuanced examination of constructive versus destructive dialogue patterns.

pdf bib abs
Multi-Agent LLM Debate Unveils the Premise Left Unsaid
Harvey Bonmu Ku | Jeongyeol Shin | Hyoun Jun Lee | Seonok Na | Insu Jeon

Implicit premise is central to argumentative coherence and faithfulness, yet remain elusive in traditional single-pass computational models. We introduce a multi-agent framework that casts implicit premise recovery as a dialogic reasoning task between two LLM agents. Through structured rounds of debate, agents critically evaluate competing premises and converge on the most contextually appropriate interpretation. Evaluated on a controlled binary classification benchmark for premise selection, our approach achieves state-of-the-art accuracy, outperforming both neural baselines and single-agent LLMs. We find that accuracy gains stem not from repeated generation, but from agents refining their predictions in response to opposing views. Moreover, we show that forcing models to defend assigned stances degrades performance—engendering rhetorical rigidity to flawed reasoning. These results underscore the value of interactive debate in revealing pragmatic components of argument structure.

pdf bib abs
Leveraging Graph Structural Knowledge to Improve Argument Relation Prediction in Political Debates
Deborah Dore | Stefano Faralli | Serena Villata

Argument Mining (AM) aims at detecting argumentation structures (i.e., premises and claims linked by attack and support relations) in text. A natural application domain is political debates, where uncovering the hidden dynamics of a politician’s argumentation strategies can help the public to identify fallacious and propagandist arguments. Despite the few approaches proposed in the literature to apply AM to political debates, this application scenario is still challenging, and, more precisely, concerning the task of predicting the relation holding between two argument components. Most of AM relation prediction approaches only consider the textual content of the argument component to identify and classify the argumentative relation holding among them (i.e., support, attack), and they mostly ignore the structural knowledge that arises from the overall argumentation graph. In this paper, we propose to address the relation prediction task in AM by combining the structural knowledge provided by a Knowledge Graph Embedding Model with the contextual knowledge provided by a fine-tuned Language Model. Our experimental setting is grounded on a standard AM benchmark of televised political debates of the US presidential campaigns from 1960 to 2020. Our extensive experimental setting demonstrates that integrating these two distinct forms of knowledge (i.e., the textual content of the argument component and the structural knowledge of the argumentation graph) leads to novel pathways that outperform existing approaches in the literature on this benchmark and enhance the accuracy of the predictions.

pdf bib abs
On Integrating LLMs Into an Argument Annotation Workflow
Robin Schaefer

Given the recent success of LLMs across different NLP tasks, their usability for data annotation has become a promising area of research. In this work, we investigate to what extent LLMs can be used as annotators for argument components and their semantic types in German tweets through a series of experiments combining different models and prompt configurations. Each prompt is constructed from modular components, such as class definitions or contextual information. Our results suggest that LLMs can indeed perform argument annotation, particularly of semantic argument types, if provided with precise class definitions. However, a fine-tuned BERT baseline remains a strong contender, often matching or exceeding LLM performance. These findings highlight the importance of considering not only model performance, but also ecological and financial costs when defining an annotation workflow.

pdf bib abs
Practical Solutions to Practical Problems in Developing Argument Mining Systems
Debela Gemechu | Ramon Ruiz-Dolz | John Lawrence | Chris Reed

The Open Argument Mining Framework (oAMF) addresses key challenges in argument mining research which still persist despite the field’s impressive growth. Researchers often face difficulties with cross-system comparisons, incompatible representation languages, and limited access to reusable tools. The oAMF introduces a standardised yet flexible architecture that enables seamless component benchmarking, rapid pipeline prototyping using elements from diverse research traditions, and unified evaluation methodologies that preserve theoretical compatibility. By reducing technical overhead, the framework allows researchers to focus on advancing core argument mining capabilities rather than reimplementing infrastructure, fostering greater collaboration at a time when computational reasoning is increasingly vital in the era of large language models.

pdf bib abs
Argumentative Analysis of Legal Rulings: A Structured Framework Using Bobbitt’s Typology
Carlotta Giacchetta | Raffaella Bernardi | Barbara Montini | Jacopo Staiano | Serena Tomasi

Legal reasoning remains one of the most complex and nuanced domains for AI, with current tools often lacking transparency and domain adaptability. While recent advances in large language models (LLMs) offer new opportunities for legal analysis, their ability to structure and interpret judicial argumentation remains unexplored. address this gap by proposing a structured framework for AI-assisted legal reasoning, centered on argumentative analysis. this work, we use GPT-4o for discourse-level and semantic analysis to identify argumentative units and classify them according to Philippe Bobbitt’s six constitutional modalities of legal reasoning.apply this framework to legal rulings from the Italian Court of Cassation.experimental findings indicate that LLM-based tools can effectively augment and streamline legal practice, by e.g. preprocessing the legal texts under scrutiny; still, the limited performance of the state-of-the-art generative model tested indicates significant room for progress in human-AI collaboration in the legal domain.

pdf bib abs
Aspect-Based Opinion Summarization with Argumentation Schemes
Wendi Zhou | Ameer Saadat-Yazdi | Nadin Kökciyan

Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.

pdf bib abs
Investigating Subjective Factors of Argument Strength: Storytelling, Emotions, and Hedging
Carlotta Quensel | Neele Falk | Gabriella Lapesa

In assessing argument strength, the notions of what makes a good argument are manifold. With the broader trend towards treating subjectivity as an asset and not a problem in NLP, new dimensions of argument quality are studied. Although studies on individual subjective features like personal stories exist, there is a lack of large-scale analyses of the relation between these features and argument strength. To address this gap, we conduct regression analysis to quantify the impact of subjective factors – emotions, storytelling, and hedging - on two standard datasets annotated for objective argument quality and subjective persuasion. As such, our contribution is twofold: at the level of contributed resources, as there are no datasets annotated with all studied dimensions, this work compares and evaluates automated annotation methods for each subjective feature. At the level of novel insights, our regression analysis uncovers different patterns of impact of subjective features on the two facets of argument strength encoded in the datasets. Our results show that storytelling and hedging have contrasting effects on objective and subjective argument quality, while the influence of emotions depends on their rhetoric utilization rather than the domain.

pdf bib abs
DebArgVis: An Interactive Visualisation Tool for Exploring Argumentative Dynamics in Debate
Martin Gruber | Zlata Kikteva | Ignaz Rutter | Annette Hautli-Janisz

Television debates play a key role in shaping public opinion, however, the rapid exchange of viewpoints in these settings often makes it difficult to perceive the underlying nature of the discussion. While there exist several debate visualisation techniques, to the best of our knowledge, none of them emphasise the argumentative dynamics in particular. With DebArgVis, we present a new interactive debate visualisation tool that leverages data annotated with argumentation structures to demonstrate how speaker interactions unfold over time, enabling users to deepen their comprehension of the debate.

pdf bib abs
Automatic Identification and Naming of Overlapping and Topic-specific Argumentation Frames
Carolin Schindler | Annalena Aicher | Niklas Rach | Wolfgang Minker

Being aware of frames, i.e., the aspect-based grouping of arguments, is crucial in applications that build upon a corpus of arguments, allowing, among others, biases and filter bubbles to be mitigated. However, manually identifying and naming these frames can be time-consuming and therefore not feasible for larger datasets. Within this work, we present a sequential three-step pipeline for automating this task in a data-driven manner. After embedding the arguments, we apply clustering algorithms for identifying the frames and subsequently, utilize methods from the field of cluster labeling to name the frames. The proposed approach is tailored towards the requirements of practical applications where arguments may not be easily split into their argumentative units and hence can belong to more than one frame. Performing a component-wise evaluation, we determine the best-performing configuration of the pipeline. Our results indicate that frames should be identified by performing overlapping and not exclusive clustering and the naming of frames can be accomplished best by extracting aspect terms and weighting them with c-TF-IDF.

pdf bib abs
A Simple but Effective Context Retrieval for Sequential Sentence Classification in Long Legal Documents
Anas Belfathi | Nicolas Hernandez | Monceaux Laura | Richard Dufour

Sequential sentence classification extends traditional classification, especially useful when dealing with long documents. However, state-of-the-art approaches face two major challenges: pre-trained language models struggle with input-length constraints, while proposed hierarchical models often introduce irrelevant content. To address these limitations, we propose a simple and effective document-level retrieval approach that extracts only the most relevant context. Specifically, we introduce two heuristic strategies: Sequential, which captures local information, and Selective, which retrieves the semantically similar sentences. Experiments on legal domain datasets show that both heuristics lead to consistent improvements over the baseline, with an average increase of ∼5.5 weighted-F1 points. Sequential heuristics outperform hierarchical models on two out of three datasets, with gains of up to ∼1.5, demonstrating the benefits of targeted context.

pdf bib abs
Stance-aware Definition Generation for Argumentative Texts
Natalia Evgrafova | Loic De Langhe | Els Lefever | Veronique Hoste

Definition generation models trained on dictionary data are generally expected to produce neutral and unbiased output while capturing the contextual nuances. However, previous studies have shown that generated definitions can inherit biases from both the underlying models and the input context. This paper examines the extent to which stance-related bias in argumentative data influences the generated definitions. In particular, we train a model on a slang-based dictionary to explore the feasibility of generating persuasive definitions that concisely reflect opposing parties’ understandings of contested terms. Through this study, we provide new insights into bias propagation in definition generation and its implications for definition generation applications and argument mining.

pdf bib abs
Reproducing the Argument Quality Prediction of Project Debater
Ines Zelch | Matthias Hagen | Benno Stein | Johannes Kiesel

A crucial task when analyzing arguments is to determine their quality. Especially when you have to choose from a large number of suitable arguments, the determination of a reliable argument quality value is of great benefit. Probably the best-known model for determining such an argument quality value was developed in IBM’s Project Debater and made available to the research community free of charge via an API. In fact, the model was never open and the API is no longer available. In this paper, IBM’s model is reproduced using the freely available training data and the description in the corresponding publication. Our reproduction achieves similar results on the test data as described in the original publication. Further, the predicted quality scores of reproduction and original show a very high correlation (Pearson’s r=0.9) on external data.

pdf bib abs
Reasoning Under Distress: Mining Claims and Evidence in Mental Health Narratives
Jannis Köckritz | Bahar İlgen | Georges Hattab

This paper explores the application of argument mining to mental health narratives using zero‐shot transfer learning. We fine‐tune a BERT‐based sentence classifier on ~15k essays from the Persuade dataset—achieving 69.1% macro‐F1 on its test set—and apply it without domain adaptation to the CAMS dataset, which consists of anonymized mental health–related Reddit posts. On a manually annotated gold‐standard set of 150 CAMS sentences, our model attains 54.7% accuracy and 48.9% macro‐F1, with evidence detection (F1 = 63.4%) transferring more effectively than claim identification (F1 = 32.0%). Analysis across expert‐annotated causal factors of distress shows that personal narratives heavily favor experiential evidence (65–77% of sentences) compared to academic writing. The prevalence of evidence sentences, many of which appear to be grounded in lived experiences, such as descriptions of emotional states or personal events, suggests that personal narratives favor descriptive recollection over formal, argumentative reasoning. These findings underscore the unique challenges of argument mining in affective contexts and offer recommendations for enhancing argument mining tools within clinical and digital mental health support systems.

pdf bib abs
Multi-Class versus Means-End: Assessing Classification Approaches for Argument Patterns
Maximilian Heinrich | Khalid Al Khatib | Benno Stein

In the study of argumentation, the schemes introduced by Walton et al. (2008) represent a significant advancement in understanding and analyzing the structure and function of arguments. Walton’s framework is particularly valuable for computational reasoning, as it facilitates the identification of argument patterns and the reconstruction of enthymemes. Despite its practical utility, automatically identifying these schemes remains a challenging problem. To aid human annotators, Visser et al. (2021) developed a decision tree for scheme classification. Building on this foundation, we propose a means-end approach to argument scheme classification that systematically leverages expert knowledge—encoded in a decision tree—to guide language models through a complex classification task. We assess the effectiveness of the means-end approach by conducting a comprehensive comparison with a standard multi-class approach across two datasets, applying both prompting and supervised learning methods to each approach. Our results indicate that the means-end approach, when combined with supervised learning, achieves scores only slightly lower than those of the multi-class classification approach. At the same time, the means-end approach enhances explainability by identifying the specific steps in the decision tree that pose the greatest challenges for each scheme—offering valuable insights for refining the overall means-end classification process.

pdf bib abs
From Debates to Diplomacy: Argument Mining Across Political Registers
Maria Poiaganova | Manfred Stede

This paper addresses the problem of cross-register generalization in argument mining within political discourse. We examine whether models trained on adversarial, spontaneous U.S. presidential debates can generalize to the more diplomatic and prepared register of UN Security Council (UNSC) speeches. To this end, we conduct a comprehensive evaluation across four core argument mining tasks. Our experiments show that the tasks of detecting and classifying argumentative units transfer well across registers, while identifying and labeling argumentative relations remains notably challenging, likely due to register-specific differences in how argumentative relations are structured and expressed. As part of this work, we introduce ArgUNSC, a new corpus of 144 UNSC speeches manually annotated with claims, premises, and their argumentative links. It provides a resource for future in- and cross-domain studies and novel research directions at the intersection of argument mining and political science.

pdf bib abs
Storytelling in Argumentative Discussions: Exploring the Use of Narratives in ChangeMyView
Sara Nabhani | Khalid Al Khatib | Federico Pianzola | Malvina Nissim

Psychological research has long suggested that storytelling can shape beliefs and behaviors by fostering emotional engagement and narrative transportation. However, it remains unclear whether these effects extend to online argumentative discourse. In this paper, we examine the role of narrative in real-world argumentation using discussions from the ChangeMyView subreddit. Leveraging an automatic story detection model, we analyze how narrative use varies across persuasive comments, user types, discussion outcomes, and the kinds of change being sought. While narrative appears more frequently in some contexts, it is not consistently linked to successful persuasion. Notably, highly persuasive users tend to use narrative less, and storytelling does not demonstrate increased effectiveness for any specific type of persuasive goals. These findings suggest that narrative may play a limited and context-dependent role in online discussions, highlighting the need for computational models of argumentation to account for rhetorical diversity.

pdf bib abs
Segmentation of Argumentative Texts by Key Statements for Argument Mining from the Web
Ines Zelch | Matthias Hagen | Benno Stein | Johannes Kiesel

Argument mining is the task of identifying the argument structure of a text: claims, premises, support/attack relations, etc. However, determining the complete argument structure can be quite involved, especially for unpolished texts from online forums, while for many applications the identification of argumentative key statements would suffice (e.g., for argument search). To this end, we introduce and investigate the new task of segmenting an argumentative text by its key statements. We formalize the task, create a first dataset from online communities, propose an evaluation scheme, and conduct a pilot study with several approaches. Interestingly, our experimental results indicate that none of the tested approaches (even LLM-based ones) can actually satisfactorily solve key statement segmentation yet.

The proliferation of AI technologies has reinforced the importance of developing critical thinking skills. We propose leveraging Large Language Models (LLMs) to facilitate the generation of critical questions: inquiries designed to identify fallacious or inadequately constructed arguments. This paper presents an overview of the first shared task on Critical Questions Generation (CQs-Gen). Thirteen teams investigated various methodologies for generating questions that critically assess arguments within the provided texts. The highest accuracy achieved was 67.6, indicating substantial room for improvement in this task. Moreover, three of the four top-performing teams incorporated argumentation scheme annotations to enhance their systems. Finally, while most participants employed open-weight models, the two highest-ranking teams relied on proprietary LLMs.

pdf bib abs
StateCloud at Critical Questions Generation: Prompt Engineering for Critical Question Generation
Jinghui Zhang | Dongming Yang | Binghuai Lin

This paper presents StateCloud’s submission to the Critical Questions Generation (CQs-Gen) shared task at the Argument Mining Workshop 2025. To generate high-quality critical questions from argumentative texts, we propose a framework that combines prompt engineering with few-shot learning to effectively guide generative models. Additionally, we ensemble outputs from diverse large language models (LLMs) to enhance accuracy. Notably, our approach achieved 3rd place in the competition, demonstrating the viability of prompt engineering strategies for argumentative tasks.

pdf bib abs
Tdnguyen at CQs-Gen 2025: Adapt Large Language Models with Multi-Step Reasoning for Critical Questions Generation
Tien-Dat Nguyen | Duc-Vu Nguyen

This paper explores the generation of Critical Questions (CQs) from argumentative texts using multi-step reasoning techniques, specifically Chain-of-Thoughts (CoT) and Tree-of-Thoughts (ToT) prompting frameworks. CQs are essential for enhancing critical thinking and improving decision-making across various domains. Despite the promise of Large Language Models (LLMs) in this task, generating contextually relevant and logically sound questions remains a challenge. Our experiments show that CoT-based prompting strategies, including Zero-shot and One-shot methods, significantly outperform baseline models in generating high-quality CQs. While ToT prompting offers a more flexible reasoning structure, it was less effective than CoT in this task. We suggest exploring more advanced or computationally intense multi-step reasoning techniques, as well as alternative tree structures for the ToT framework, to further improve CQs-Gen systems.

pdf bib abs
Webis at CQs-Gen 2025: Prompting and Reranking for Critical Questions
Midhun Kanadan | Johannes Kiesel | Maximilian Heinrich | Benno Stein

This paper reports on the submission of team extitWebis to the Critical Question Generation shared task at the 12th Workshop on Argument Mining (ArgMining 2025). Our approach is a fully automated two-stage pipeline that first prompts a large language model (LLM) to generate candidate critical questions for a given argumentative intervention, and then reranks the generated questions as per a classifier’s confidence in their usefulness. For the generation stage, we tested zero-shot, few-shot, and chain-of-thought prompting strategies. For the reranking stage, we used a ModernBERT classifier that we fine-tuned on either the validation set or an augmented version. Among our submissions, the best-performing configuration achieved a test score of 0.57 and ranked 5th in the shared task. Submissions that use reranking consistently outperformed baseline submissions without reranking across all metrics. Our results demonstrate that combining openweight LLMs with reranking significantly improves the quality of the resulting critical questions.

pdf bib abs
DayDreamer at CQs-Gen 2025: Generating Critical Questions through Argument Scheme Completion
Wendi Zhou | Ameer Saadat-Yazdi | Nadin Kökciyan

Critical questions are essential resources to provoke critical thinking when encountering an argumentative text. We present our system for the Critical Questions Generation (CQs-Gen) Shared Task at ArgMining 2025. Our approach leverages large language models (LLMs) with chain-of-thought prompting to generate critical questions guided by Walton’s argumentation schemes. For each input intervention, we conversationally prompt LLMs to instantiate the corresponding argument scheme template to first obtain structured arguments, and then generate relevant critical questions. Following this, we rank all the available critical questions by prompting LLMs to select the top 3 most helpful questions based on the original intervention text. This combination of structured argumentation theory and step-by-step reasoning enables the generation of contextually relevant and diverse critical questions. Our pipeline achieves competitive performance in the final test set, showing its potential to foster critical thinking given argumentative text and detect missing or uninformed claims.

pdf bib abs
CUET_SR34 at at CQs-Gen 2025: Critical Question Generation via Few-Shot LLMs – Integrating NER and Argument Schemes
Sajib Bhattacharjee | Tabassum Basher Rashfi | Samia Rahman | Hasan Murad

Critical Question Generation (CQs-Gen) improves reasoning and critical thinking skills through Critical Questions (CQs), which identify reasoning gaps and address misinformation in NLP, especially as LLM-based chat systems are widely used for learning and may encourage superficial learning habits. The Shared Task on Critical Question Generation, hosted at the 12th Workshop on Argument Mining and co-located in ACL 2025, has aimed to address these challenges. This study proposes a CQs-Gen pipeline using Llama-3-8B-Instruct-GGUF-Q8_0 with few-shot learning, integrating text simplification, NER, and argument schemes to enhance question quality. Through an extensive experiment testing without training, fine-tuning with PEFT using LoRA on 10% of the dataset, and few-shot fine-tuning (using five examples) with an 8-bit quantized model, we demonstrate that the few-shot approach outperforms others. On the validation set, 397 out of 558 generated CQs were classified as Useful, representing 71.1% of the total. In contrast, on the test set, 49 out of 102 generated CQs, accounting for 48% of the total, were classified as Useful following evaluation through semantic similarity and manual assessments.

pdf bib abs
ARG2ST at CQs-Gen 2025: Critical Questions Generation through LLMs and Usefulness-based Selection
Alan Ramponi | Gaudenzia Genoni | Sara Tonelli

Critical questions (CQs) generation for argumentative texts is a key task to promote critical thinking and counter misinformation. In this paper, we present a two-step approach for CQs generation that i) uses a large language model (LLM) for generating candidate CQs, and ii) leverages a fine-tuned classifier for ranking and selecting the top-k most useful CQs to present to the user. We show that such usefulness-based CQs selection consistently improves the performance over the standard application of LLMs. Our system was designed in the context of a shared task on CQs generation hosted at the 12th Workshop on Argument Mining, and represents a viable approach to encourage future developments on CQs generation. Our code is made available to the research community.

pdf bib abs
CriticalBrew at CQs-Gen 2025: Collaborative Multi-Agent Generation and Evaluation of Critical Questions for Arguments
Roxanne El Baff | Dominik Opitz | Diaoulé Diallo

This paper presents the CriticalBrew submission to the CQs-Gen 2025 shared task, which focuses on generating critical questions (CQs) for a given argument. Our approach employs a multi-agent framework containing two sequential components: 1) extbfGeneration: machine society simulation for generating CQs and 2) extbfEvaluation: LLM-based evaluation for selecting the top three questions. The first models collaboration as a sequence of thinking patterns (e.g., debate →reflect). The second assesses the generated questions using zero-shot prompting, evaluating them against several criteria (e.g., depth). Experiments with different open-weight LLMs (small vs. large) consistently outperformed the baseline, a single LLM with zero-shot prompting. Two configurations, agent count and thinking patterns, significantly impacted the performance in the shared task’s CQ-usefulness evaluation, whereas different LLM-based evaluation strategies (e.g., scoring) had no impact. Our code is available on GitHub.

pdf bib abs
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
Lucile Favero | Daniel Frases | Juan Antonio Pérez-Ortiz | Tanja Käser

The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.

pdf bib abs
Mind_Matrix at CQs-Gen 2025: Adaptive Generation of Critical Questions for Argumentative Interventions
Sha Newaz Mahmud | Shahriar Hossain | Samia Rahman | Momtazul Arefin Labib | Hasan Murad

To encourage computational argumentation through critical question generation (CQs-Gen),we propose an ACL 2025 CQs-Gen shared task system to generate critical questions (CQs) with the best effort to counter argumentative text by discovering logical fallacies, unjustified assertions, and implicit assumptions.Our system integrates a quantized language model, semantic similarity analysis, and a meta-evaluation feedback mechanism including the key stages such as data preprocessing, rationale-augmented prompting to induce specificity, diversity filtering for redundancy elimination, enriched meta-evaluation for relevance, and a feedback-reflect-refine loop for iterative refinement. Multi-metric scoring guarantees high-quality CQs. With robust error handling, our pipeline ranked 7th among 15 teams, outperforming baseline fact-checking approaches by enabling critical engagement and successfully detecting argumentative fallacies. This study presents an adaptive, scalable method that advances argument mining and critical discourse analysis.

pdf bib abs
COGNAC at CQs-Gen 2025: Generating Critical Questions with LLM-Assisted Prompting and Multiple RAG Variants
Azwad Anjum Islam | Tisa Islam Erana | Mark A. Finlayson

We describe three approaches to solving the Critical Questions Generation Shared Task at ArgMining 2025. The task objective is to automatically generate critical questions that challenge the strength, validity, and credibility of a given argumentative text. The task dataset comprises debate statements (“interventions”) annotated with a list of named argumentation schemes and associated with a set of critical questions (CQs). Our three Retrieval-Augmented Generation (RAG)-based approaches used in-context example selection based on (1) embedding the intervention, (2) embedding the intervention plus manually curated argumentation scheme descriptions as supplementary context, and (3) embedding the intervention plus a selection of associated CQs and argumentation scheme descriptions. We developed the prompt templates through GPT-4o-assisted analysis of patterns in validation data and the task-specific evaluation guideline. All three of our submitted systems outperformed the official baselines (0.44 and 0.53) with automatically computed accuracies of 0.62, 0.58, and 0.61, respectively, on the test data, with our first method securing the 2nd place in the competition (0.63 manual evaluation). Our results highlight the efficacy of LLM-assisted prompt development and RAG-enhanced generation in crafting contextually relevant critical questions for argument analysis.

pdf bib abs
TriLLaMa at CQs-Gen 2025: A Two-Stage LLM-Based System for Critical Question Generation
Frieso Turkstra | Sara Nabhani | Khalid Al-Khatib

This paper presents a new system for generating critical questions in debates, developed for the Critical Questions Generation shared task. Our two-stage approach, combining generation and classification, utilizes LLaMA 3.1 Instruct models (8B, 70B, 405B) with zero-/few-shot prompting. Evaluations on annotated debate data reveal several key insights: few-shot generation with 405B yielded relatively high-quality questions, achieving a maximum possible punctuation score of 73.5. The 70B model outperformed both smaller and larger variants on the classification part. The classifiers showed a strong bias toward labeling generated questions as Useful, despite limited validation. Further, our system, ranked 6 extsuperscriptth, out-performed baselines by 3%. These findings stress the effectiveness of large-sized models for question generation and medium-sized models for classification, and suggest the need for clearer task definitions within prompts to improve classification accuracy.

pdf bib abs
Overview of MM-ArgFallacy2025 on Multimodal Argumentative Fallacy Detection and Classification in Political Debates
Eleonora Mancini | Federico Ruggeri | Serena Villata | Paolo Torroni

We present an overview of the MM-ArgFallacy2025 shared task on Multimodal Argumentative Fallacy Detection and Classification in Political Debates, co-located with the 12th Workshop on Argument Mining at ACL 2025. The task focuses on identifying and classifying argumentative fallacies across three input modes: text-only, audio-only, and multimodal (text+audio), offering both binary detection (AFD) and multi-class classification (AFC) subtasks. The dataset comprises 18,925 instances for AFD and 3,388 instances for AFC, from the MM-USED-Fallacy corpus on U.S. presidential debates, annotated for six fallacy types: Ad Hominem, Appeal to Authority, Appeal to Emotion, False Cause, Slippery Slope, and Slogan. A total of 5 teams participated: 3 on classification and 2 on detection. Participants employed transformer-based models, particularly RoBERTa variants, with strategies including prompt-guided data augmentation, context integration, specialised loss functions, and various fusion techniques. Audio processing ranged from MFCC features to state-of-the-art speech models. Results demonstrated textual modality dominance, with best text-only performance reaching 0.4856 F1-score for classification and 0.34 for detection. Audio-only approaches underperformed relative to text but showed improvements over previous work, while multimodal fusion showed limited improvements. This task establishes important baselines for multimodal fallacy analysis in political discourse, contributing to computational argumentation and misinformation detection capabilities.

pdf bib abs
Argumentative Fallacy Detection in Political Debates
Eva Cantín Larumbe | Adriana Chust Vendrell

Building on recent advances in Natural Language Processing (NLP), this work addresses the task of fallacy detection in political debates using a multimodal approach combining text and audio, as well as text-only and audio-only approaches. Although the multimodal setup is novel, results show that text-based models consistently outperform both audio-only and multimodal models, confirming that textual information remains the most effective for this task. Transformer-based and few-shot architectures were used to detect fallacies. While fine-tuned language models demonstrate strong performance, challenges such as data imbalance, audio processing, and limited dataset size persist.

pdf bib abs
Multimodal Argumentative Fallacy Classification in Political Debates
Warale Avinash Kalyan | Siddharth Pagaria | Chaitra V | Spoorthi H G

Argumentative fallacy classification plays a crucial role in improving discourse quality by identifying flawed reasoning that may mislead or manipulate audiences. While traditional approaches have primarily relied on textual analysis, they often overlook paralinguistic cues such as intonation and prosody that are present in speech. In this study, we explore how multimodal analysis, in which we combine textual and audio features, can enhance fallacy classification in political debates. We develop and evaluate text-only, audio-only, and multimodal models using the MM-USED-fallacy dataset to assess the contribution of each modality. Our findings indicate that the multimodal model, which integrates linguistic and acoustic signals, outperforms unimodal systems, underscoring the potential of multimodal approaches in capturing complex argumentative structures.

pdf bib abs
Prompt-Guided Augmentation and Multi-modal Fusion for Argumentative Fallacy Classification in Political Debates
Abdullah Tahir | Imaan Ibrar | Huma Ameer | Mehwish Fatima | Seemab Latif

Classifying argumentative fallacies in political discourse is challenging due to their subtle, persuasive nature across text and speech. In our MM-ArgFallacy Shared Task submission, Team NUST investigates uni-modal (text/audio) and multi-modal (text+audio) setups using pretrained models—RoBERTa for text and Whisper for audio. To tackle severe class imbalance, we introduce Prompt-Guided Few-Shot Augmentation (PG-FSA) to generate synthetic samples for underrepresented fallacies. We further propose a late fusion architecture combining linguistic and paralinguistic cues, enhanced with balancing techniques like SMOTE and Focal Loss. Our approach achieves top performance across modalities, ranking 1st in text-only and multi-modal tracks, and 3rd in audio-only, on the official leaderboard. These results underscore the effectiveness of targeted augmentation and modular fusion in multi-modal fallacy classification.

pdf bib abs
Leveraging Context for Multimodal Fallacy Classification in Political Debates
Alessio Pittiglio

In this paper, we present our submission to the MM-ArgFallacy2025 shared task, which aims to advance research in multimodal argument mining, focusing on logical fallacies in political debates. Our approach uses pretrained Transformer-based models and proposes several ways to leverage context. In the fallacy classification subtask, our models achieved macro F1-scores of 0.4444 (text), 0.3559 (audio), and 0.4403 (multimodal). Our multimodal model showed performance comparable to the text-only model, suggesting potential for improvements.