This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we generate only three BibTeX files per volume, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
The computational treatment of arguments on controversial issues has been subject to extensive NLP research, due to its envisioned impact on opinion formation, decision making, writing education, and the like. A critical task in any such application is the assessment of an argument’s quality - but it is also particularly challenging. In this position paper, we start from a brief survey of argument quality research, where we identify the diversity of quality notions and the subjectiveness of their perception as the main hurdles towards substantial progress on argument quality assessment. We argue that the capabilities of instruction-following large language models (LLMs) to leverage knowledge across contexts enable a much more reliable assessment. Rather than just fine-tuning LLMs towards leaderboard chasing on assessment tasks, they need to be instructed systematically with argumentation theories and scenarios as well as with ways to solve argument-related problems. We discuss the real-world opportunities and ethical issues emerging thereby.
Research on language as interactive discourse underscores the deliberate use of demographic parameters such as gender, ethnicity, and class to shape social identities. For example, by explicitly disclosing one’s information and enforcing one’s social identity to an online community, the reception by and interaction with the said community is impacted, e.g., strengthening one’s opinions by depicting the speaker as credible through their experience in the subject. Here, we present a first thorough study of the role and effects of self-disclosures on online discourse dynamics, focusing on a pervasive type of self-disclosure: author gender. Concretely, we investigate the contexts and properties of gender self-disclosures and their impact on interaction dynamics in an online persuasive forum, ChangeMyView. Our contribution is twofold. At the level of the target phenomenon, we fill a research gap in the understanding of the impact of these self-disclosures on the discourse by bringing together features related to forum activity (votes, number of comments), linguistic/stylistic features from the literature, and discourse topics. At the level of the contributed resource, we enrich and release a comprehensive dataset that will provide a further impulse for research on the interplay between gender disclosures, community interaction, and persuasion in online discourse.
Storytelling, i.e., the use of of anecdotes and personal experiences, plays a crucial role in everyday argumentation. This is particularly true for the highly controversial debates that spark in times of crisis - where the focus of the discussion is on heterogeneous aspects of everyday life. For individuals, stories can have a strong persuasive power; for a larger collective, stories can help decision-makers to develop strategies for addressing the challenges people are facing, especially in times of crisis. In this paper, we analyse the use of storytelling in the COVID-19 discourse. We carry out our analysis on three publicly available Reddit datasets, for a total of 367K comments. We automatically annotate the Reddit datasets by detecting spans containing storytelling and classifying them into: a) personal vs. general – is the story experienced by the speaker? b) argumentative function (Does the story clarify a problem, potentially consisting in harm to a specific group? Does it exemplify a solution to a problem, or does it establish the credibility of the speaker?), and c) topic. We then carry out an analysis which establishes the relevance of storytelling in the COVID discourse and further uncovers interactions between topics and types of stories associated to them.
Computational argumentation is an interdisciplinary research field, connecting Natural Language Processing (NLP) to other disciplines such as the social sciences. The focus of recent research has concentrated on argument quality assessment: what makes an argument good or bad? We present a tutorial which is an updated edition of the EACL 2023 tutorial presented by the same authors. As in the previous version, the tutorial will have a strong interdisciplinary and interactive nature, and will be structured along three main coordinates: (1) the notions of argument quality (AQ) across disciplines (how do we recognize good and bad arguments?), with a particular focus on the interface between Argument Mining (AM) and Deliberation Theory; (2) the modeling of subjectivity (who argues to whom; what are their beliefs?); and (3) the generation of improved arguments (what makes an argument better?). The tutorial will also touch upon a series of topics that are particularly relevant for the LREC-COLING audience (the issue of resource quality for the assessment of AQ; the interdisciplinary application of AM and AQ in a text-as-data approach to Political Science), in line with the developments in NLP (LLMs for AQ assessment), and relevant for the societal applications of AQ assessment (bias and debiasing). We will involve the participants in two annotation studies on the assessment and the improvement of quality.
Effective content moderation is imperative for fostering healthy and productive discussions in online domains. Despite the substantial efforts of moderators, the overwhelming nature of discussion flow can limit their effectiveness. However, it is not only trained moderators who intervene in online discussions to improve their quality. “Ordinary” users also act as moderators, actively intervening to correct information of other users’ posts, enhance arguments, and steer discussions back on course.This paper introduces the phenomenon of user moderation, documenting and releasing UMOD, the first dataset of comments in whichusers act as moderators. UMOD contains 1000 comment-reply pairs from the subreddit r/changemyview with crowdsourced annotations from a large annotator pool and with a fine-grained annotation schema targeting the functions of moderation, stylistic properties(aggressiveness, subjectivity, sentiment), constructiveness, as well as the individual perspectives of the annotators on the task. The releaseof UMOD is complemented by two analyses which focus on the constitutive features of constructiveness in user moderation and on thesources of annotator disagreements, given the high subjectivity of the task.
Computational argumentation is an interdisciplinary research field, connecting Natural Language Processing (NLP) to other disciplines such as the social sciences. This tutorial will focus on a task that recently got into the center of attention in the community: argument quality assessment, that is, what makes an argument good or bad? We structure the tutorial along three main coordinates: (1) the notions of argument quality across disciplines (how do we recognize good and bad arguments?), (2) the modeling of subjectivity (who argues to whom; what are their beliefs?), and (3) the generation of improved arguments (what makes an argument better?). The tutorial highlights interdisciplinary aspects of the field, ranging from the collaboration of theory and practice (e.g., in NLP and social sciences), to approaching different types of linguistic structures (e.g., social media versus parliamentary texts), and facing the ethical issues involved (e.g., how to build applications for the social good). A key feature of this tutorial is its interactive nature: We will involve the participants in two annotation studies on the assessment and the improvement of quality, and we will encourage them to reflect on the challenges and potential of these tasks.
Assessing the quality of an argument is a complex, highly subjective task, influenced by heterogeneous factors (e.g., prior beliefs of the annotators, topic, domain, and application), and crucial for its impact in downstream tasks (e.g., argument retrieval or generation). Both the Argument Mining and the Social Science community have devoted plenty of attention to it, resulting in a wide variety of argument quality dimensions and a large number of annotated resources. This work aims at a better understanding of how the different aspects of argument quality relate to each other from a practical point of view. We employ adapter-fusion (Pfeiffer et al., 2021) as a multi-task learning framework which a) can improve the prediction of individual quality dimensions by injecting knowledge about related dimensions b) is efficient and modular and c) can serve as an analysis tool to investigate relations between different dimensions. We conduct experiments on 6 datasets and 20 quality dimensions. We find that the majority of the dimensions can be learned as a weighted combination of other quality aspects, and that for 8 dimensions adapter fusion improves quality prediction. Last, we show the benefits of this approach by improving the performance in an extrinsic, out-of-domain task: prediction of moderator interventions in a deliberative forum.
Humans are storytellers, even in communication scenarios which are assumed to be more rationality-oriented, such as argumentation. Indeed, supporting arguments with narratives or personal experiences (henceforth, stories) is a very natural thing to do – and yet, this phenomenon is largely unexplored in computational argumentation. Which role do stories play in an argument? Do they make the argument more effective? What are their narrative properties? To address these questions, we collected and annotated StoryARG, a dataset sampled from well-established corpora in computational argumentation (ChangeMyView and RegulationRoom), and the Social Sciences (Europolis), as well as comments to New York Times articles. StoryARG contains 2451 textual spans annotated at two levels. At the argumentative level, we annotate the function of the story (e.g., clarification, disclosure of harm, search for a solution, establishing speaker’s authority), as well as its impact on the effectiveness of the argument and its emotional load. At the level of narrative properties, we annotate whether the story has a plot-like development, is factual or hypothetical, and who the protagonist is. What makes a story effective in an argument? Our analysis of the annotations in StoryARG uncover a positive impact on effectiveness for stories which illustrate a solution to a problem, and in general, annotator-specific preferences that we investigate with regression analysis.
Argument maps structure discourse into nodes in a tree with each node being an argument that supports or opposes its parent argument. This format is more comprehensible and less redundant compared to an unstructured one. Exploring those maps and maintaining their structure by placing new arguments under suitable parents is more challenging for users with huge maps that are typical in online discussions. To support those users, we introduce the task of node placement: suggesting candidate nodes as parents for a new contribution. We establish an upper-bound of human performance, and conduct experiments with models of various sizes and training strategies. We experiment with a selection of maps from Kialo, drawn from a heterogeneous set of domains. Based on an annotation study, we highlight the ambiguity of the task that makes it challenging for both humans and models. We examine the unidirectional relation between tree nodes and show that encoding a node into different embeddings for each of the parent and child cases improves performance. We further show the few-shot effectiveness of our approach.
Reports of personal experiences or stories can play a crucial role in argumentation, as they represent an immediate and (often) relatable way to back up one’s position with respect to a given topic. They are easy to understand and increase empathy: this makes them powerful in argumentation. The impact of personal reports and stories in argumentation has been studied in the Social Sciences, but it is still largely underexplored in NLP. Our work is the first step towards filling this gap: our goal is to develop robust classifiers to identify documents containing personal experiences and reports. The main challenge is the scarcity of annotated data: our solution is to leverage existing annotations to be able to scale-up the analysis. Our contribution is two-fold. First, we conduct a set of in-domain and cross-domain experiments involving three datasets (two from Argument Mining, one from the Social Sciences), modeling architectures, training setups and fine-tuning options tailored to the involved domains. We show that despite the differences among datasets and annotations, robust cross-domain classification is possible. Second, we employ linear regression for performance mining, identifying performance trends both for overall classification performance and individual classifier predictions.
The empirical quantification of the quality of a contribution to a political discussion is at the heart of deliberative theory, the subdiscipline of political science which investigates decision-making in deliberative democracy. Existing annotation on deliberative quality is time-consuming and carried out by experts, typically resulting in small datasets which also suffer from strong class imbalance. Scaling up such annotations with automatic tools is desirable, but very challenging. We take up this challenge and explore different strategies to improve the prediction of deliberative quality dimensions (justification, common good, interactivity, respect) in a standard dataset. Our results show that simple data augmentation techniques successfully alleviate data imbalance. Classifiers based on linguistic features (textual complexity and sentiment/polarity) and classifiers integrating argument quality annotations (from the argument mining community in NLP) were consistently outperformed by transformer-based models, with or without data augmentation.
Agenda-setting is a widely explored phenomenon in political science: powerful stakeholders (governments or their financial supporters) have control over the media and set their agenda: political and economical powers determine which news should be salient. This is a clear case of targeted manipulation to divert the public attention from serious issues affecting internal politics (such as economic downturns and scandals) by flooding the media with potentially distracting information. We investigate agenda-setting in the Russian social media landscape, exploring the relation between economic indicators and mentions of foreign geopolitical entities, as well as of Russia itself. Our contributions are at three levels: at the level of the domain of the investigation, our study is the first to substructure the Russian media landscape in state-controlled vs. independent outlets in the context of strategic distraction from negative economic trends; at the level of the scope of the investigation, we involve a large set of geopolitical entities (while previous work has focused on the U.S.); at the qualitative level, our analysis of posts on Ukraine, whose relationship with Russia is of high geopolitical relevance, provides further insights into the contrast between state-controlled and independent outlets.
The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called QuBERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84–87% on POS tasks.
Many tasks in text-based computational social science (CSS) involve the classification of political statements into categories based on a domain-specific codebook. In order to be useful for CSS analysis, these categories must be fine-grained. The typically skewed distribution of fine-grained categories, however, results in a challenging classification problem on the NLP side. This paper proposes to make use of the hierarchical relations among categories typically present in such codebooks:e.g., markets and taxation are both subcategories of economy, while borders is a subcategory of security. We use these ontological relations as prior knowledge to establish additional constraints on the learned model, thusimproving performance overall and in particular for infrequent categories. We evaluate several lightweight variants of this intuition by extending state-of-the-art transformer-based textclassifiers on two datasets and multiple languages. We find the most consistent improvement for an approach based on regularization.
Translate-train or few-shot cross-lingual transfer can be used to improve the zero-shot performance of multilingual pretrained language models. Few-shot utilizes high-quality low-quantity samples (often manually translated from the English corpus ). Translate-train employs a machine translation of the English corpus, resulting in samples with lower quality that could be scaled to high quantity. Given the lower cost and higher availability of machine translation compared to manual professional translation, it is important to systematically compare few-shot and translate-train, understand when each has an advantage, and investigate how to choose the shots to translate in order to increase the few-shot gain. This work aims to fill this gap: we compare and quantify the performance gain of few-shot vs. translate-train using three different base models and a varying number of samples for three tasks/datasets (XNLI, PAWS-X, XQuAD) spanning 17 languages. We show that scaling up the training data using machine translation gives a larger gain compared to using the small-scale (higher-quality) few-shot data. When few-shot is beneficial, we show that there are random sets of samples that perform better across languages and that the performance on English and on the machine-translation of the samples can both be used to choose the shots to manually translate for an increased few-shot gain.
What is the first word that comes to your mind when you hear giraffe, or damsel, or freedom? Such free associations contain a huge amount of information on the mental representations of the corresponding concepts, and are thus an extremely valuable testbed for the evaluation of semantic representations extracted from corpora. In this paper, we present FAST (Free ASsociation Tasks), a free association dataset for English rigorously sampled from two standard free association norms collections (the Edinburgh Associative Thesaurus and the University of South Florida Free Association Norms), discuss two evaluation tasks, and provide baseline results. In parallel, we discuss methodological considerations concerning the desiderata for a proper evaluation of semantic representations.
The analysis of public debates crucially requires the classification of political demands according to hierarchical claim ontologies (e.g. for immigration, a supercategory “Controlling Migration” might have subcategories “Asylum limit” or “Border installations”). A major challenge for automatic claim classification is the large number and low frequency of such subclasses. We address it by jointly predicting pairs of matching super- and subcategories. We operationalize this idea by (a) encoding soft constraints in the claim classifier and (b) imposing hard constraints via Integer Linear Programming. Our experiments with different claim classifiers on a German immigration newspaper corpus show consistent performance increases for joint prediction, in particular for infrequent categories and discuss the complementarity of the two approaches.
Kiezdeutsch is a variety of German predominantly spoken by teenagers from multi-ethnic urban neighborhoods in casual conversations with their peers. In recent years, the popularity of Kiezdeutsch has increased among young people, independently of their socio-economic origin, and has spread in social media, too. While previous studies have extensively investigated this language variety from a linguistic and qualitative perspective, not much has been done from a quantitative point of view. We perform the first large-scale data-driven analysis of the lexical and morpho-syntactic properties of Kiezdeutsch in comparison with standard German. At the level of results, we confirm predictions of previous qualitative analyses and integrate them with further observations on specific linguistic phenomena such as slang and self-centered speaker attitude. At the methodological level, we provide logistic regression as a framework to perform bottom-up feature selection in order to quantify differences across language varieties.
Human moderation is commonly employed in deliberative contexts (argumentation and discussion targeting a shared decision on an issue relevant to a group, e.g., citizens arguing on how to employ a shared budget). As the scale of discussion enlarges in online settings, the overall discussion quality risks to drop and moderation becomes more important to assist participants in having a cooperative and productive interaction. The scale also makes it more important to employ NLP methods for(semi-)automatic moderation, e.g. to prioritize when moderation is most needed. In this work, we make the first steps towards (semi-)automatic moderation by using state-of-the-art classification models to predict which posts require moderation, showing that while the task is undoubtedly difficult, performance is significantly above baseline. We further investigate whether argument quality is a key indicator of the need for moderation, showing that surprisingly, high quality arguments also trigger moderation. We make our code and data publicly available.
This survey builds an interdisciplinary picture of Argument Mining (AM), with a strong focus on its potential to address issues related to Social and Political Science. More specifically, we focus on AM challenges related to its applications to social media and in the multilingual domain, and then proceed to the widely debated notion of argument quality. We propose a novel definition of argument quality which is integrated with that of deliberative quality from the Social Science literature. Under our definition, the quality of a contribution needs to be assessed at multiple levels: the contribution itself, its preceding context, and the consequential effect on the development of the upcoming discourse. The latter has not received the deserved attention within the community. We finally define an application of AM for Social Good: (semi-)automatic moderation, a highly integrative application which (a) represents a challenging testbed for the integrated notion of quality we advocate, (b) allows the empirical quantification of argument/deliberative quality to benefit from the developments in other NLP fields (i.e. hate speech detection, fact checking, debiasing), and (c) has a clearly beneficial potential at the level of its societal thanks to its real-world application (even if extremely ambitious).
DEbateNet-migr15 is a manually annotated dataset for German which covers the public debate on immigration in 2015. The building block of our annotation is the political science notion of a claim, i.e., a statement made by a political actor (a politician, a party, or a group of citizens) that a specific action should be taken (e.g., vacant flats should be assigned to refugees). We identify claims in newspaper articles, assign them to actors and fine-grained categories and annotate their polarity and date. The aim of this paper is two-fold: first, we release the full DEbateNet-mig15 corpus and document it by means of a quantitative and qualitative analysis; second, we demonstrate its application in a discourse network analysis framework, which enables us to capture the temporal dynamics of the political debate
Manifestos are official documents of political parties, providing a comprehensive topical overview of the electoral programs. Voters, however, seldom read them and often prefer other channels, such as newspaper articles, to understand the party positions on various policy issues. The natural question to ask is how compatible these two formats (manifesto and newspaper reports) are in their representation of party positioning. We address this question with an approach that combines political science (manual annotation and analysis) and natural language processing (supervised claim identification) in a cross-text type setting: we train a classifier on annotated newspaper data and test its performance on manifestos. Our findings show a) strong performance for supervised classification even across text types and b) a substantive overlap between the two formats in terms of party positioning, with differences regarding the salience of specific issues.
This paper describes the MARDY corpus annotation environment developed for a collaboration between political science and computational linguistics. The tool realizes the complete workflow necessary for annotating a large newspaper text collection with rich information about claims (demands) raised by politicians and other actors, including claim and actor spans, relations, and polarities. In addition to the annotation GUI, the tool supports the identification of relevant documents, text pre-processing, user management, integration of external knowledge bases, annotation comparison and merging, statistical analysis, and the incorporation of machine learning models as “pseudo-annotators”.
This paper presents a large-scale evaluation study of dependency-based distributional semantic models. We evaluate dependency-filtered and dependency-structured DSMs in a number of standard semantic similarity tasks, systematically exploring their parameter space in order to give them a “fair shot” against window-based models. Our results show that properly tuned window-based DSMs still outperform the dependency-based models in most tasks. There appears to be little need for the language-dependent resources and computational cost associated with syntactic analysis.
This paper presents the results of a large-scale evaluation study of window-based Distributional Semantic Models on a wide variety of tasks. Our study combines a broad coverage of model parameters with a model selection methodology that is robust to overfitting and able to capture parameter interactions. We show that our strategy allows us to identify parameter configurations that achieve good performance across different datasets and tasks.
The aim of this paper is to introduce LexIt, a computational framework for the automatic acquisition and exploration of distributional information about Italian verbs, nouns and adjectives, freely available through a web interface at the address http://sesia.humnet.unipi.it/lexit. LexIt is the first large-scale resource for Italian in which subcategorization and semantic selection properties are characterized fully on distributional ground: in the paper we describe both the process of data extraction and the evaluation of the subcategorization frames extracted with LexIt.
n this paper, we outline the methodology we adopted to develop a FrameNet for Italian. The main element of novelty with respect to the original FrameNet is represented by the fact that the creation and annotation of Lexical Units is strictly grounded in distributional information (statistical distribution of verbal subcategorization frames, lexical and semantic preferences of each frame) automatically acquired from a large, dependency-parsed corpus. We claim that this approach allows us to overcome some of the shortcomings of the classical lexicographic method used to create FrameNet, by complementing the accuracy of manual annotation with the robustness of data on the global distributional patterns of a verb. In the paper, we describe our method for extracting distributional data from the corpus and the way we used it for the encoding and annotation of LUs. The long-term goal of our project is to create an electronic lexicon for Italian similar to the original English FrameNet. For the moment, we have developed a database of syntactic valences that will be made freely accessible via a web interface. This represents an autonomous resource besides the FrameNet lexicon, of which we have a beginning nucleus consisting of 791 annotated sentences.