Proceedings of the 13th Workshop on Argument Mining and Reasoning
Mohamed Elaraby, Annette Hautli-Janisz, Julia Romberg, Elena Musi, Federico Ruggeri, John Lawrence (Editors)
- Anthology ID:
- 2026.argmining-1
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Venues:
- ArgMining | WS
- Events:
- Annual Meeting of the Association for Computational Linguistics (2026) | Workshop on Argument Mining (2026) | Other Workshops and Events (2026)
- SIG:
- Publisher:
- Association for Computational Linguistics
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.argmining-1/
- DOI:
- ISBN:
- 979-8-89176-399-9
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.argmining-1.pdf
Proceedings of the 13th Workshop on Argument Mining and Reasoning
Mohamed Elaraby | Annette Hautli-Janisz | Julia Romberg | Elena Musi | Federico Ruggeri | John Lawrence
Mohamed Elaraby | Annette Hautli-Janisz | Julia Romberg | Elena Musi | Federico Ruggeri | John Lawrence
STCOR: A Trilevel Syllogism-Driven Reasoning Framework
Keying Yang | Hao Wang | Chengtao Jian | Kai Yang
Keying Yang | Hao Wang | Chengtao Jian | Kai Yang
Inspired by the human expert thinking paradigm in operations research, this work introduces a new concept of reasoning tasks: Textual Constrained Optimization (TCO) problems. A TCO problem is characterized by a natural language description that implicitly specifies an underlying structured model with variables, constraints, and objectives. We propose a novel Syllogism-driven Textual Constrained Optimization Reasoning (STCOR) paradigm, driven by classical syllogistic logic. Unlike contemporary stepwise methods, our framework structures reasoning into three phases: meta-modeling, which acts as the major premise by retrieving a relevant class-driven prototype template; formalization, which serves as the minor premise by instantiating the template into an explicit logical model from textual queries; and solving, which derives the final answer as conclusion. To support the end-to-end implementation, we further develop a tri-level optimization algorithm TriRL.
Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification
Eleni Papadopulos | Firoj Alam | Giovanni Da San Martino
Eleni Papadopulos | Firoj Alam | Giovanni Da San Martino
In today’s fast-paced information era, logical fallacies, defined as defective patterns of reasoning, inevitably contribute to the growth of information disorder. However, often fallacies appear in nuanced forms that complicate automated classification. In this study, we investigate whether merging abstract logical structures with context-level linguistic cues proves beneficial for fallacy classification, developing a framework that inductively extracts such patterns from fallacious examples and their explanations using Large Language Models (LLMs). We evaluate the impact of these patterns across different LLMs and experimental zero- and one-shot configurations, showing statistically significant improvements over zero-shot baselines and outperforming competing approaches. Cross-dataset experiments validate generalization, establishing data-driven pattern extraction as an effective method for generating logical representations.
A Three-Level Audit of LLM Alignment for Argument Quality Assessment
Wei-Fan Chen | Jinming Yu | Lucie Flek
Wei-Fan Chen | Jinming Yu | Lucie Flek
Large Language Models (LLMs) are increasingly used as automated evaluators of argument quality. However, existing studies typically assess models only through their agreement with human scores, leaving the reasoning process behind these judgments unexplored. In this paper, we propose a three-level audit framework for evaluating the reliability of LLM-based argument quality assessment. The framework distinguishes between (1) surface alignment, measuring agreement between LLM-predicted scores and human annotations; (2) instructional alignment, assessing whether generated rationales adhere to the intended evaluation criteria; and (3) faithfulness alignment, examining whether predicted scores are supported by the generated rationales. To operationalize this audit, we introduce structural rationale prompting, which guides LLMs to generate structured justifications before assigning scores across 11 dimensions of the Dagstuhl-15512 argument quality corpus. We evaluate several LLMs under this framework and find that structural rationale prompting substantially improves agreement with human annotations compared to definition-based prompting. Furthermore, the generated rationales generally follow the evaluation instructions and remain highly consistent with the predicted scores. Overall, our results suggest that auditing LLM evaluators beyond surface score agreement provides deeper insight into the reliability and transparency of LLM-based evaluation.
Stance classification is a core task in argument mining and subjectivity analysis, crucial for understanding public discourse and opinion dynamics on social media. Despite their impressive few-shot capabilities, Large Language Models (LLMs) remain sensitive to prompt construction, including the selection and ordering of in-context examples. In this paper, we propose a Topic-Guided prompting method for argument stance classification that dynamically integrates topic-specific information into the few-shot context. We evaluate our method on five LLMs across three datasets spanning formal debates and user-generated online comments. Our extensive evaluation shows that our proposed Topic-Guided prompting outperforms standard few-shot prompting and state-of-the-art example selection strategies. Further analysis indicates that our method reduces the bias towards the ’support’ class observed in several models, resulting in more balanced predictions across stances and thus a more robust approach to stance classification.
AMResources: Cataloging Argument Mining Datasets
Dexter Williams | Shiwei Liu | Manfred Stede | Henning Wachsmuth | Jodi Schneider
Dexter Williams | Shiwei Liu | Manfred Stede | Henning Wachsmuth | Jodi Schneider
Annotated datasets are essential for developing and evaluating argument mining systems, yet information about argument mining datasets remains scattered across papers, repositories, and task-specific surveys. To address this, we introduce AMResources (http://purl.archive.org/amresources), an online catalog that organizes argument mining datasets by task, and captures relationships among datasets, releases, and papers. We draw particular attention to relationships such as re-annotation and dataset extension. To curate dataset information into a consistent and provenance-aware structure, AMResources links datasets to canonical papers. For each dataset release, AMResources records standardized metadata such as language, genre, unit type and unit count, annotator characteristics, agreement reporting, and accessibility. We argue that such structured dataset documentation remains critical in the era of large language models, where annotated datasets increasingly serve as high-quality evaluation benchmarks and where tracing dataset provenance and annotation layers is necessary for systematic comparisons across tasks.
Argument-Based Comparative Question Answering Evaluation Benchmark
Irina Nikishina | Saba Anwar | Nikolay Dolgov | Maria Manina | Daria Ignatenko | Viktor Moskvoretskii | Artem Shelmanov | Tim Baldwin | Chris Biemann
Irina Nikishina | Saba Anwar | Nikolay Dolgov | Maria Manina | Daria Ignatenko | Viktor Moskvoretskii | Artem Shelmanov | Tim Baldwin | Chris Biemann
Despite the ability of large language models (LLMs) to generate coherent comparative answers, automatic comparative question answering (CQA) remains challenging due to the absence of standardized evaluation criteria and the high resource demands of manual assessment. To address these problems, this paper proposes a comprehensive evaluation framework designed to assess the quality of CQA summaries using LLMs-as-a-Judge. We formulate 15 evaluation criteria for assessing comparative answers generated by various sources, including LLMs, human experts, and prior work. To capture a diverse range of comparative answers, LLM summaries were generated under various prompting scenarios. We evaluate the effectiveness of our framework using both human assessment and LLMs, demonstrating the consistency between automated and manual evaluations. Finally, we fine-tune Llama-3-8B-Instruct on a dataset generated from the best-performing CQA models in our evaluation.
Illustrating Arguments with Images Using Aspect-Aware Prompting
Maximilian Heinrich | Sharat Anand | Johannes Kiesel | Benno Stein
Maximilian Heinrich | Sharat Anand | Johannes Kiesel | Benno Stein
Images can powerfully strengthen arguments, conveying ideas more immediately and compellingly than text alone. With the rise of text-to-image models, a broad audience can now generate custom visuals to illustrate their arguments. Yet a fundamental mismatch undermines this potential: these models are trained on concrete scene descriptions, while arguments operate at the level of general, abstract principles. Naively prompting such a model with an argumentative text therefore rarely produces images that genuinely illustrate the argument. To address this challenge, we propose an aspect-aware image generation approach. Given an argument, our method first identifies the key aspects that an illustrative image should convey, then constructs a detailed scene description grounded in both the argument and those aspects, and finally generates an image using that scene description as the prompt. A human-assessment evaluation demonstrates that this approach yields images that illustrate arguments significantly better than those produced by naive prompting.
Do We Need Large Models for Argument Classification? Revisiting the Role of Model Compression
Filip Gampel | Rafał Olszowski | Marcin Pietroń
Filip Gampel | Rafał Olszowski | Marcin Pietroń
Large language models have improved argument mining substantially, but the associated computational cost complicates deployment, replication, and systematic comparison. We examine how much compression an open-source large language model can tolerate before argument classification quality degrades. Using gpt-oss-20b as the base model, we study pruning with Wanda and post-training quantization under a zero-shot prompting setup. We evaluate compressed variants on three argument-mining resources, namely UKP, Args.me, and ARIES, and contrast their behavior with general language-model benchmarks. The results show a consistent pattern: moderate pruning preserves most of the original performance on argument classification, whereas activation quantization causes larger and more systematic drops. The findings suggest that argument classification is more compression-tolerant than general-purpose evaluation suites, but only up to a point, and they should not be interpreted as evidence that aggressive compression is universally safe. We therefore position compression as a practical way to reduce model cost for argument analysis, while emphasizing that claims about efficiency gains must distinguish between preserved predictive quality and realized runtime speedups.
A Neural Approach to Fine-Grained Argumentation Strategy Classification with Emotion and Moral Value Lexicons across Multiple Domains
Mohammad Yeghaneh Abkenar | Weixing Wang | Manfred Stede | Julia Romberg
Mohammad Yeghaneh Abkenar | Weixing Wang | Manfred Stede | Julia Romberg
Fine-grained argumentation mining goes beyond coarse-grained distinctions such as claim and premise, by delving deeper into the underlying strategies employed (e.g., the use of facts or values to persuade the audience). Despite the advancements brought about by pre-trained language models, the task remains challenging. We investigate whether auxiliary knowledge such as emotion and moral value lexicon features can improve the classification of fine-grained argumentation strategies. Our Neural Flair Transformer Classifier (NFTC), in its base form, fine-tunes a transformer-based document encoder (RoBERTa) for end-to-end argument component classification. Evaluated across four corpora from diverse domains spanning public participation, persuasive forums, product reviews, and student essays, NFTC consistently outperforms majority-voting and Qwen2.5-7B baselines, achieving competitive performance on all datasets. Moreover, gains are observed against a fine-tuned LLaMA-3-8B-Instruct model, regarded in prior work as a leading approach. Injecting additional knowledge into NFTC yields mixed effects: emotion and moral value features provide consistent gains in product reviews and persuasive forums, but not in the other two domains. Our findings suggest that the utility of subjective knowledge is domain and schema dependent.
Overview of the UZH Shared Task 2026 on Reconstructing the Reasoning in United Nations Resolutions
Anastassia Shaitarova | Yingqiang Gao | Fatma-Zohra Rezkellah | Reto Gubelmann | Patrick Montjouridès
Anastassia Shaitarova | Yingqiang Gao | Fatma-Zohra Rezkellah | Reto Gubelmann | Patrick Montjouridès
This paper presents the UZH Shared Task at the 13th Workshop on Argument Mining and Reasoning, co-located with ACL 2026, which focuses on reconstructing argumentative structure in highly formal legal-political texts, namely United Nations resolutions and recommendations. The shared task addresses the challenge of recovering paragraph-level reasoning patterns from the fairly formulaic structure of international decision-making records. It comprises two subtasks: (1) paragraph classification, where systems identify paragraph type (preambular or operative) and assign one or more thematic tags, and (2) argumentative relation prediction, where systems infer links between paragraphs and label them with relation types.
LLM-INSTRUCT at UZH Shared Task 2026: Constraint-Aware Retrieval and Selective Debate for Paragraph-Level Argument Mining
Phuong Huu Vu Tran | Long Minh Vo | Son Nguyen Minh Le | Hoang Van
Phuong Huu Vu Tran | Long Minh Vo | Son Nguyen Minh Le | Hoang Van
We present LLM-INSTRUCT, the winning system for the UZH Shared Task at ArgMining 2026 on paragraph-level argument mining in UN and UNESCO resolutions. The task requires paragraph-type classification, prediction of a subset of 141 official tags, and directed relation prediction under a strict JSON schema setting using only open-weight models up to 8B parameters. We frame the task as constrained structured prediction. The system first narrows the candidate tag space with metadata-aware dense retrieval, then applies constrained decoding with per-dimension caps, and escalates only uncertain cases to a three-agent debate branch.
RESOLVENOW at UZH Shared Task 2026: Rule-Based Type Classification with LLM-Driven Multi-Label Tagging for UN Resolutions
Vedant Gupta | Rahul Bhatia | Vaibhav Varshney | Manjunatha Naik
Vedant Gupta | Rahul Bhatia | Vaibhav Varshney | Manjunatha Naik
Subtask 1 of the UZH Shared Task 2026 asks for paragraph-level classification of UN resolutions as preambular or operative and multi-label tagging from a 141-code, 15-dimension taxonomy, scored by tag F1 and an open-weight LLM-as-Judge on reasoning quality. Two earlier pipelines we built failed in opposite ways. An embedding-retrieval system dropped relevant tags before the LLM saw them; a per-dimension prompting system was accurate but too slow to iterate. The submitted system fixes both. A deterministic French-English lexical classifier assigns paragraph types at type macro-F1 of 0.910 on the official silver standard with no LLM calls, and DeepSeek-R1-0528-Qwen3-8B predicts tags through a single merged prompt that exposes the full taxonomy.
Argchestrators at UZH Shared Task 2026: Efficient Argument Mining in UN Resolutions: A Sub-8B Pipeline using Agentic Debate and Heuristic Retrieval
Bogdan Octavian Grecu | Gerrit Quaremba | Elizabeth Black | Denny Vrandečić | Elena Simperl | Oana Cocarascu
Bogdan Octavian Grecu | Gerrit Quaremba | Elizabeth Black | Denny Vrandečić | Elena Simperl | Oana Cocarascu
The highly formal and negotiated language of United Nations (UN) resolutions presents unique challenges for argument mining. This paper describes our system submitted to the ArgMining 2026 Shared Task: Reconstructing the Reasoning in United Nations Resolutions. Adhering to the strict constraint of utilising open-weight models with at most 8 billion parameters, we propose a hybrid, compute-efficient architecture powered by Qwen3-8B. For the preambular-operative classification, we implement a set of deterministic rules related to the specificity of UN documents, supplemented by an LLM-based multi-label classifier for thematic dimensions and a directed-graph extraction approach for argumentative relation prediction.
Prompteam at UZH Shared Task 2026: RAG-Augmented Classification and Cosine-Filtered Relation Prediction for UN Resolutions
Siddhartha Khandelwal | Jyotsana Bhardwaj
Siddhartha Khandelwal | Jyotsana Bhardwaj
We describe our system for the UZH ArgMining 2026 Shared Task on reconstructing argumentative structure in UN/UNESCO resolutions. The task requires (1) classifying paragraph types and assigning thematic tags from a 141-label taxonomy, and (2) predicting directed argumentative relations between paragraphs. Our pipeline combines a quantised Qwen2.5-7B-Instruct model with retrieval-augmented generation (RAG) backed by FAISS-indexed dense embeddings for few-shot prompting and tag candidate pre-filtering. For relation prediction, we apply a sliding-window cosine pre-filter that reduces the quadratic pair space to near-linear cost. A parallelisable, fault-tolerant pipeline with atomic checkpointing enabled complete processing of 2,959 paragraphs across three concurrent Kaggle T4 sessions despite 12-hour GPU limits. Our system achieved 2nd place overall on the shared task leaderboard.
TypeCoT at UZH Shared Task 2026: Reconstructing Argumentative Structure in UN Resolutions using Type-Informed Chain-of-Thought
Chandan Kumar R S | Vinay Babu Ulli | Jyoti Kumari | Vaibhav Singh
Chandan Kumar R S | Vinay Babu Ulli | Jyoti Kumari | Vaibhav Singh
United Nations and UNESCO resolutions encode complex collective reasoning through highly structured preambles and operative clauses. Reconstructing this implicit argumentative structure is a challenging natural language processing task. This paper describes our submission to the UZH Shared Task at the ArgMining Workshop 2026. Adhering to the strict constraint of using open-weight models with at most 8B parameters, we propose a highly efficient, modular pipeline built entirely upon the Qwen-2.5-7B-Instruct architecture. To address Subtask 1, we decouple the problem, employing a 4-bit quantized LoRA adapter via the Unsloth framework for paragraph type classification and a type-informed chain-of-thought approach for thematic tagging and relation prediction.
POINTERS at UZH Shared Task 2026: Reasoning Probes for Argumentation Mining in UN Resolutions
Sohom Sen | Avina Nakarmi | Xun Song | Aritra Dasgupta
Sohom Sen | Avina Nakarmi | Xun Song | Aritra Dasgupta
This paper describes the submission of team POINTERS to the UZH ArgMining 2026 Shared Task, which aims to recover the argumentation structure of UN and UNESCO resolutions by labeling paragraph types, assigning specific tags, and predicting relations between paragraphs. We take a generative approach, treating each resolution as a sequence of claim-evidence pairs connected by explicit reasoning strategies. First, each paragraph is classified as preambular or operative and assigned tags, with the model required to quote specific phrases to justify every decision. Second, for each paragraph, we first retrieve semantically related candidates using sentence transformers, then use reasoning strategies as a diagnostic scaffold to label the relation—supporting, complemental, contradictive, or modifying—along with a quoted, strategy-grounded rationale.
HybridArguer at UZH Shared Task 2026: Argument Structure Modeling in Bilingual UN Resolutions with Retrieval-Augmented and Iterative LLM Reasoning
Siddharth Bhargava
Siddharth Bhargava
Extracting argument structures from legal-political discourse reveals how policies and actions are proposed, debated, and formalized, but remains challenging due to the complexity of long-form, structured text. This work proposes a modular, retrieval-augmented system for traceable and structured argument mining in long, bilingual United Nations resolutions. This paper describes our system submission to the UZH Shared Task 2026, focusing on practical design choices for argument structure modeling under task and model constraints. Our system employs a parameter-efficient (at most 8B) open-source model, Qwen3:8B in thinking mode, to perform paragraph classification, multi-label tag assignment, and multi-label relation prediction through a modular, retrieval-augmented pipeline.