Ramon Ruiz-Dolz

2025

pdf bib abs
Mining Complex Patterns of Argumentative Reasoning in Natural Language Dialogue
Ramon Ruiz-Dolz | Zlata Kikteva | John Lawrence
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Argumentation scheme mining is the task of automatically identifying reasoning mechanisms behind argument inferences. These mechanisms provide insights into underlying argument structures and guide the assessment of natural language arguments. Research on argumentation scheme mining, however, has always been limited by the scarcity of large enough publicly available corpora containing scheme annotations. In this paper, we present the first state-of-the-art results for mining argumentation schemes in natural language dialogue. For this purpose, we create QT-Schemes, a new corpus of 441 arguments annotated with 24 argumentation schemes. Using this corpus, we leverage the capabilities of LLMs and Transformer-based models, pre-training them on a large corpus containing textbook-like argumentation schemes and validating their applicability in real-world scenarios.

Despite extensive research in Argument Mining (AM), the field faces significant challenges in limited reproducibility, difficulty in comparing systems due to varying task combinations, and a lack of interoperability caused by the heterogeneous nature of argumentation theory. These challenges are further exacerbated by the absence of dedicated tools, with most advancements remaining isolated research outputs rather than reusable systems. The oAMF (Open Argument Mining Framework) addresses these issues by providing an open-source, modular, and scalable platform that unifies diverse AM methods. Initially released with seventeen integrated modules, the oAMF serves as a starting point for researchers and developers to build, experiment with, and deploy AM pipelines while ensuring interoperability and allowing multiple theories of argumentation to co-exist within the same framework. Its flexible design supports integration via Python APIs, drag-and-drop tools, and web interfaces, streamlining AM development for research and industry setup, facilitating method comparison, and reproducibility.

pdf bib
Proceedings of the 12th Argument mining Workshop
Elena Chistova | Philipp Cimiano | Shohreh Haddadan | Gabriella Lapesa | Ramon Ruiz-Dolz
Proceedings of the 12th Argument mining Workshop

pdf bib abs
Practical Solutions to Practical Problems in Developing Argument Mining Systems
Debela Gemechu | Ramon Ruiz-Dolz | John Lawrence | Chris Reed
Proceedings of the 12th Argument mining Workshop

The Open Argument Mining Framework (oAMF) addresses key challenges in argument mining research which still persist despite the field’s impressive growth. Researchers often face difficulties with cross-system comparisons, incompatible representation languages, and limited access to reusable tools. The oAMF introduces a standardised yet flexible architecture that enables seamless component benchmarking, rapid pipeline prototyping using elements from diverse research traditions, and unified evaluation methodologies that preserve theoretical compatibility. By reducing technical overhead, the framework allows researchers to focus on advancing core argument mining capabilities rather than reimplementing infrastructure, fostering greater collaboration at a time when computational reasoning is increasingly vital in the era of large language models.

pdf bib abs
Looking at the Unseen: Effective Sampling of Non-Related Propositions for Argument Mining
Ramon Ruiz-Dolz | Debela Gemechu | Zlata Kikteva | Chris Reed
Proceedings of the 31st International Conference on Computational Linguistics

Traditionally, argument mining research has approached the task of automatic identification of argument structures by using existing definitions of what constitutes an argument, while leaving the equally important matter of what does not qualify as an argument unaddressed. With the ability to distinguish between what is and what is not a natural language argument being at the core of argument mining as a field, it is interesting that no previous work has explored approaches to effectively select non-related propositions (i.e., propositions that are not connected through an argumentative relation, such as support or attack) that improve the data for learning argument mining tasks better. In this paper, we address the question of how to effectively sample non-related propositions from six different argument mining corpora belonging to different domains and encompassing both monologue and dialogue forms of argumentation. To that end, in addition to considering undersampling baselines from previous work, we propose three new sampling strategies relying on context (i.e., short/long) and the semantic similarity between propositions. Our results indicate that using more informed sampling strategies improves the performance, not only when evaluating models on their respective test splits, but also in the case of cross-domain evaluation.

pdf bib abs
Natural Language Reasoning in Large Language Models: Analysis and Evaluation
Debela Gemechu | Ramon Ruiz-Dolz | Henrike Beyer | Chris Reed
Findings of the Association for Computational Linguistics: ACL 2025

While Large Language Models (LLMs) have demonstrated promising results on a range of reasoning benchmarks—particularly in formal logic, mathematical tasks, and Chain-of-Thought prompting—less is known about their capabilities in unconstrained natural language reasoning. Argumentative reasoning, a form of reasoning naturally expressed in language and central to everyday discourse, presents unique challenges for LLMs due to its reliance on context, implicit assumptions, and value judgments. This paper addresses a gap in the study of reasoning in LLMs by presenting the first large-scale evaluation of their unconstrained natural language reasoning capabilities based on natural language argumentation. The paper offers three contributions: (i) the formalisation of a new strategy designed to evaluate argumentative reasoning in LLMs: argument-component selection; (ii) the creation of the Argument Reasoning Tasks (ART) dataset, a new benchmark for argument-component selection based on argument structures for natural language reasoning; and (iii) an extensive experimental analysis involving four different models, demonstrating the limitations of LLMs on natural language reasoning tasks.

pdf bib abs
A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks
Haorui Yu | Ramon Ruiz-Dolz | Qiufeng Yi
Findings of the Association for Computational Linguistics: EMNLP 2025

This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM’s ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks.

2024

pdf bib abs
ARIES: A General Benchmark for Argument Relation Identification
Debela Gemechu | Ramon Ruiz-Dolz | Chris Reed
Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)

Measuring advances in argument mining is one of the main challenges in the area. Different theories of argument, heterogeneous annotations, and a varied set of argumentation domains make it difficult to contextualise and understand the results reported in different work from a general perspective. In this paper, we present ARIES, a general benchmark for Argument Relation Identification aimed at providing with a standard evaluation for argument mining research. ARIES covers the three different language modelling approaches: sequence and token modelling, and sequence-to-sequence-to-sequence alignment, together with the three main Transformer-based model architectures: encoder-only, decoder-only, and encoder-decoder. Furthermore, the benchmark consists of eight different argument mining datasets, covering the most common argumentation domains, and standardised with the same annotation structures. This paper provides a first comprehensive and comparative set of results in argument mining across a broad range of configurations to compare with, both advancing the state-of-the-art, and establishing a standard way to measure future advances in the area. Across varied task setups and architectures, our experiments reveal consistent challenges in cross-dataset evaluation, with notably poor results. Given the models’ struggle to acquire transferable skills, the task remains challenging, opening avenues for future research.

pdf bib abs
Overview of DialAM-2024: Argument Mining in Natural Language Dialogues
Ramon Ruiz-Dolz | John Lawrence | Ella Schad | Chris Reed
Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)

Argumentation is the process by which humans rationally elaborate their thoughts and opinions in written (e.g., essays) or spoken (e.g., debates) contexts. Argument Mining research, however, has been focused on either written argumentation or spoken argumentation but without considering any additional information, e.g., speech acts and intentions. In this paper, we present an overview of DialAM-2024, the first shared task in dialogical argument mining, where argumentative relations and speech illocutions are modelled together in a unified framework. The task was divided into two different sub-tasks: the identification of propositional relations and the identification of illocutionary relations. Six different teams explored different methodologies to leverage both sources of information to reconstruct argument maps containing the locutions uttered in the speeches and the argumentative propositions implicit in them. The best performing team achieved an F1-score of 67.05% in the overall evaluation of the reconstruction of complete argument maps, considering both sub-tasks included in the DialAM-2024 shared task.

pdf bib abs
Learning Strategies for Robust Argument Mining: An Analysis of Variations in Language and Domain
Ramon Ruiz-Dolz | Chr-Jr Chiu | Chung-Chi Chen | Noriko Kando | Hsin-Hsi Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Argument mining has typically been researched for specific corpora belonging to concrete languages and domains independently in each research work. Human argumentation, however, has domain- and language-dependent linguistic features that determine the content and structure of arguments. Also, when deploying argument mining systems in the wild, we might not be able to control some of these features. Therefore, an important aspect that has not been thoroughly investigated in the argument mining literature is the robustness of such systems to variations in language and domain. In this paper, we present a complete analysis across three different languages and three different domains that allow us to have a better understanding on how to leverage the scarce available corpora to design argument mining systems that are more robust to natural language variations.

2023

pdf bib abs
Detecting Argumentative Fallacies in the Wild: Problems and Limitations of Large Language Models
Ramon Ruiz-Dolz | John Lawrence
Proceedings of the 10th Workshop on Argument Mining

Previous work on the automatic identification of fallacies in natural language text has typically approached the problem in constrained experimental setups that make it difficult to understand the applicability and usefulness of the proposals in the real world. In this paper, we present the first analysis of the limitations that these data-driven approaches could show in real situations. For that purpose, we first create a validation corpus consisting of natural language argumentation schemes. Second, we provide new empirical results to the emerging task of identifying fallacies in natural language text. Third, we analyse the errors observed outside of the testing data domains considering the new validation corpus. Finally, we point out some important limitations observed in our analysis that should be taken into account in future research in this topic. Specifically, if we want to deploy these systems in the Wild.

pdf bib abs
VivesDebate-Speech: A Corpus of Spoken Argumentation to Leverage Audio Features for Argument Mining
Ramon Ruiz-Dolz | Javier Iranzo-Sánchez
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In this paper, we describe VivesDebate-Speech, a corpus of spoken argumentation created to leverage audio features for argument mining tasks. The creation of this corpus represents an important contribution to the intersection of speech processing and argument mining communities, and one of the most complete publicly available resources in this topic. Moreover, we have performed a set of first-of-their-kind experiments which show an improvement when integrating audio features into the argument mining pipeline. The provided results can be used as a baseline for future research.

pdf bib abs
Automatic Debate Evaluation with Argumentation Semantics and Natural Language Argument Graph Networks
Ramon Ruiz-Dolz | Stella Heras | Ana Garcia
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The lack of annotated data on professional argumentation and complete argumentative debates has led to the oversimplification and the inability of approaching more complex natural language processing tasks. Such is the case of the automatic evaluation of complete professional argumentative debates. In this paper, we propose an original hybrid method to automatically predict the winning stance in this kind of debates. For that purpose, we combine concepts from argumentation theory such as argumentation frameworks and semantics, with Transformer-based architectures and neural graph networks. Furthermore, we obtain promising results that lay the basis on an unexplored new instance of the automatic analysis of natural language arguments.

Co-authors

Venues