This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
ImanJundi
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Human moderators in online discussions face a heterogeneous range of tasks, which go beyond content moderation, or policing. They also support and improve discussion quality, which is challenging to model (and evaluate) in NLP due to its inherent subjectivity and the scarcity of annotated resources. We address this gap by introducing PerspectiveMod, a dataset of online comments annotated for the question: *“Does this comment require moderation, and why?”* Annotations were collected from both expert moderators and trained non-experts. **PerspectiveMod** is unique in its intentional variation across (a) the level of moderation experience embedded in the source data (professional vs. non-professional moderation environments), (b) the annotator profiles (experts vs. trained crowdworkers), and (c) the richness of each moderation judgment, both in terms on fine-grained comment properties (drawn from argumentation and deliberative theory) and in the representation of the individuality of the annotator (socio-demographics and attitudes towards the task). We advance understanding of the task’s complexity by providing interpretation layers that account for its subjectivity. Our statistical analysis highlights the value of collecting annotator perspectives, including their experiences, attitudes, and views on AI, as a foundation for developing more context-aware and interpretively robust moderation tools.
Moderation is essential for maintaining and improving the quality of online discussions. This involves: (1) countering negativity, e.g. hate speech and toxicity, and (2) promoting positive discourse, e.g. broadening the discussion to involve other users and perspectives. While significant efforts have focused on addressing negativity, driven by an urgency to address such issues, this left moderation promoting positive discourse (henceforth PositiveModeration) under-studied. With the recent advancements in LLMs, Positive Moderation can potentially be scaled to vast conversations, fostering more thoughtful discussions and bridging the increasing divide in online interactions.We advance the understanding of Positive Moderation by annotating a dataset on 13 moderation properties, e.g. neutrality, clarity and curiosity. We extract instructions from professional moderation guidelines and use them to prompt LLaMA to generate such moderation. This is followed by extensive evaluation showing that (1) annotators rate generated higher than professional moderation, but still slightly prefer professional moderation in pairwise comparison, and (2) LLMs can be used to estimate human evaluation as an efficient alternative.
Effective content moderation is imperative for fostering healthy and productive discussions in online domains. Despite the substantial efforts of moderators, the overwhelming nature of discussion flow can limit their effectiveness. However, it is not only trained moderators who intervene in online discussions to improve their quality. “Ordinary” users also act as moderators, actively intervening to correct information of other users’ posts, enhance arguments, and steer discussions back on course.This paper introduces the phenomenon of user moderation, documenting and releasing UMOD, the first dataset of comments in whichusers act as moderators. UMOD contains 1000 comment-reply pairs from the subreddit r/changemyview with crowdsourced annotations from a large annotator pool and with a fine-grained annotation schema targeting the functions of moderation, stylistic properties(aggressiveness, subjectivity, sentiment), constructiveness, as well as the individual perspectives of the annotators on the task. The releaseof UMOD is complemented by two analyses which focus on the constitutive features of constructiveness in user moderation and on thesources of annotator disagreements, given the high subjectivity of the task.
Argument maps structure discourse into nodes in a tree with each node being an argument that supports or opposes its parent argument. This format is more comprehensible and less redundant compared to an unstructured one. Exploring those maps and maintaining their structure by placing new arguments under suitable parents is more challenging for users with huge maps that are typical in online discussions. To support those users, we introduce the task of node placement: suggesting candidate nodes as parents for a new contribution. We establish an upper-bound of human performance, and conduct experiments with models of various sizes and training strategies. We experiment with a selection of maps from Kialo, drawn from a heterogeneous set of domains. Based on an annotation study, we highlight the ambiguity of the task that makes it challenging for both humans and models. We examine the unidirectional relation between tree nodes and show that encoding a node into different embeddings for each of the parent and child cases improves performance. We further show the few-shot effectiveness of our approach.
The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called QuBERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84–87% on POS tasks.
Translate-train or few-shot cross-lingual transfer can be used to improve the zero-shot performance of multilingual pretrained language models. Few-shot utilizes high-quality low-quantity samples (often manually translated from the English corpus ). Translate-train employs a machine translation of the English corpus, resulting in samples with lower quality that could be scaled to high quantity. Given the lower cost and higher availability of machine translation compared to manual professional translation, it is important to systematically compare few-shot and translate-train, understand when each has an advantage, and investigate how to choose the shots to translate in order to increase the few-shot gain. This work aims to fill this gap: we compare and quantify the performance gain of few-shot vs. translate-train using three different base models and a varying number of samples for three tasks/datasets (XNLI, PAWS-X, XQuAD) spanning 17 languages. We show that scaling up the training data using machine translation gives a larger gain compared to using the small-scale (higher-quality) few-shot data. When few-shot is beneficial, we show that there are random sets of samples that perform better across languages and that the performance on English and on the machine-translation of the samples can both be used to choose the shots to manually translate for an increased few-shot gain.
This survey builds an interdisciplinary picture of Argument Mining (AM), with a strong focus on its potential to address issues related to Social and Political Science. More specifically, we focus on AM challenges related to its applications to social media and in the multilingual domain, and then proceed to the widely debated notion of argument quality. We propose a novel definition of argument quality which is integrated with that of deliberative quality from the Social Science literature. Under our definition, the quality of a contribution needs to be assessed at multiple levels: the contribution itself, its preceding context, and the consequential effect on the development of the upcoming discourse. The latter has not received the deserved attention within the community. We finally define an application of AM for Social Good: (semi-)automatic moderation, a highly integrative application which (a) represents a challenging testbed for the integrated notion of quality we advocate, (b) allows the empirical quantification of argument/deliberative quality to benefit from the developments in other NLP fields (i.e. hate speech detection, fact checking, debiasing), and (c) has a clearly beneficial potential at the level of its societal thanks to its real-world application (even if extremely ambitious).
Human moderation is commonly employed in deliberative contexts (argumentation and discussion targeting a shared decision on an issue relevant to a group, e.g., citizens arguing on how to employ a shared budget). As the scale of discussion enlarges in online settings, the overall discussion quality risks to drop and moderation becomes more important to assist participants in having a cooperative and productive interaction. The scale also makes it more important to employ NLP methods for(semi-)automatic moderation, e.g. to prioritize when moderation is most needed. In this work, we make the first steps towards (semi-)automatic moderation by using state-of-the-art classification models to predict which posts require moderation, showing that while the task is undoubtedly difficult, performance is significantly above baseline. We further investigate whether argument quality is a key indicator of the need for moderation, showing that surprisingly, high quality arguments also trigger moderation. We make our code and data publicly available.