Stefan Dietze

2025

pdf bib abs
Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments
Marc Feger | Katarina Boland | Stefan Dietze
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Identifying arguments is a necessary prerequisite for various tasks in automated discourse analysis, particularly within contexts such as political debates, online discussions, and scientific reasoning. In addition to theoretical advances in understanding the constitution of arguments, a significant body of research has emerged around practical argument mining, supported by a growing number of publicly available datasets. On these benchmarks, BERT-like transformers have consistently performed best, reinforcing the belief that such models are broadly applicable across diverse contexts of debate. This study offers the first large-scale re-evaluation of such state-of-the-art models, with a specific focus on their ability to generalize in identifying arguments. We evaluate four transformers, three standard and one enhanced with contrastive pre-training for better generalization, on 17 English sentence-level datasets as most relevant to the task. Our findings show that, to varying degrees, these models tend to rely on lexical shortcuts tied to content words, suggesting that apparent progress may often be driven by dataset-specific cues rather than true task alignment. While the models achieve strong results on familiar benchmarks, their performance drops markedly when applied to unseen datasets. Nonetheless, incorporating both task-specific pre-training and joint benchmark training proves effective in enhancing both robustness and generalization.

pdf bib abs
SOMD2025: A Challenging Shared Tasks for Software Related Information Extraction
Sharmila Upadhyaya | Wolfgang Otto | Frank Krüger | Stefan Dietze
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

The use of software in acquiring, analyzing, and interpreting research data underscores its role as an essential artifact of scientific inquiry.Understanding and tracing the provenance of software in research helps in reproducible and collaborative research works.In this paper, we present an overview of our second iteration of the Software Mention Detection (SOMD) shared task as a part of the Scholarly Document Processing (SDP) workshop, that will be held in conjunction with ACL in 2025. We intend to foster among participants to brainstorm for optimized software mention detection and additional attributes and relation extraction tasks in the provided gold standard benchmark. Our shared task has two phases of challenges. First, the participants focus on implementing a joint framework for NER and RE for the given dataset. At the same time, the second phase includes the out-of-distribution dataset to evaluate the generalizability of the methods proposed in Phase I. The competition (March-April 2025) attracted 18 participants and spanned two months. Four teams have finished the competition and submitted full system descriptions. Participants applied various approaches, including joint and pipeline models, and explored data augmentation with LLM-generated samples.The evaluation was based on a macro-F1 score for both NER and RE, with the average reported as the SOMD-score.The winning teams achieved a SOMD-score of 0.89 in Phase I and 0.63 in Phase II, demonstrating the challenge of generalization.

2024

pdf bib
BERTweet’s TACO Fiesta: Contrasting Flavors On The Path Of Inference And Information-Driven Argument Mining On Twitter
Marc Feger | Stefan Dietze
Findings of the Association for Computational Linguistics: NAACL 2024

pdf bib abs
TACO – Twitter Arguments from COnversations
Marc Feger | Stefan Dietze
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Twitter has emerged as a global hub for engaging in online conversations and as a research corpus for various disciplines that have recognized the significance of its user-generated content. Argument mining is an important analytical task for processing and understanding online discourse. Specifically, it aims to identify the structural elements of arguments, denoted as information and inference. These elements, however, are not static and may require context within the conversation they are in, yet there is a lack of data and annotation frameworks addressing this dynamic aspect on Twitter. We contribute TACO, the first dataset of Twitter Arguments utilizing 1,814 tweets covering 200 entire COnversations spanning six heterogeneous topics annotated with an agreement of 0.718 Krippendorff’s α among six experts. Second, we provide our annotation framework, incorporating definitions from the Cambridge Dictionary, to define and identify argument components on Twitter. Our transformer-based classifier achieves an 85.06% macro F1 baseline score in detecting arguments. Moreover, our data reveals that Twitter users tend to engage in discussions involving informed inferences and information. TACO serves multiple purposes, such as training tweet classifiers to manage tweets based on inference and information elements, while also providing valuable insights into the conversational reply patterns of tweets.

pdf bib abs
Dissecting Paraphrases: The Impact of Prompt Syntax and supplementary Information on Knowledge Retrieval from Pretrained Language Models
Stephan Linzbach | Dimitar Dimitrov | Laura Kallmeyer | Kilian Evang | Hajira Jabeen | Stefan Dietze
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Pre-trained Language Models (PLMs) are known to contain various kinds of knowledge.One method to infer relational knowledge is through the use of cloze-style prompts, where a model is tasked to predict missing subjects orobjects. Typically, designing these prompts is a tedious task because small differences in syntax or semantics can have a substantial impact on knowledge retrieval performance. Simultaneously, evaluating the impact of either prompt syntax or information is challenging due to their interdependence. We designed CONPARE-LAMA – a dedicated probe, consisting of 34 million distinct prompts that facilitate comparison across minimal paraphrases. These paraphrases follow a unified meta-template enabling the controlled variation of syntax and semantics across arbitrary relations.CONPARE-LAMA enables insights into the independent impact of either syntactical form or semantic information of paraphrases on the knowledge retrieval performance of PLMs. Extensive knowledge retrieval experiments using our probe reveal that prompts following clausal syntax have several desirable properties in comparison to appositive syntax: i) they are more useful when querying PLMs with a combination of supplementary information, ii) knowledge is more consistently recalled across different combinations of supplementary information, and iii) they decrease response uncertainty when retrieving known facts. In addition, range information can boost knowledge retrieval performance more than domain information, even though domain information is more reliably helpful across syntactic forms.

2023

pdf bib abs
GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets
Wolfgang Otto | Matthäus Zloch | Lu Gan | Saurav Karmakar | Stefan Dietze
Findings of the Association for Computational Linguistics: EMNLP 2023

Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets. In order to provide a nuanced understanding of how ML models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like “our BERT-based model” or “an image CNN”. You can find the ground truth dataset and code to replicate model training at https://data.gesis.org/gsap/gsap-ner.