Torsten Zesch

2024

pdf abs
Scoring with Confidence? – Exploring High-confidence Scoring for Saving Manual Grading Effort
Marie Bexte | Andrea Horbach | Lena Schützler | Oliver Christ | Torsten Zesch
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

A possible way to save manual grading effort in short answer scoring is to automatically score answers for which the classifier is highly confident. We explore the feasibility of this approach in a high-stakes exam setting, evaluating three different similarity-based scoring methods, where the similarity score is a direct proxy for model confidence. The decision on an appropriate level of confidence should ideally be made before scoring a new prompt. We thus probe to what extent confidence thresholds are consistent across different datasets and prompts. We find that high-confidence thresholds vary on a prompt-to-prompt basis, and that the overall potential of increased performance at a reasonable cost of additional manual effort is limited.

pdf abs
LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches
Imran Chamieh | Torsten Zesch | Klaus Giebermann
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

In this work, we investigate the potential of Large Language Models (LLMs) for automated short answer scoring. We test zero-shot and few-shot settings, and compare with fine-tuned models and a supervised upper-bound, across three diverse datasets. Our results, in zero-shot and few-shot settings, show that LLMs perform poorly in these settings: LLMs have difficulty with tasks that require complex reasoning or domain-specific knowledge. While the models show promise on general knowledge tasks. The fine-tuned model come close to the supervised results but are still not feasible for application, highlighting potential overfitting issues. Overall, our study highlights the challenges and limitations of LLMs in short answer scoring and indicates that there currently seems to be no basis for applying LLMs for short answer scoring.

pdf abs
Rainbow - A Benchmark for Systematic Testing of How Sensitive Visio-Linguistic Models are to Color Naming
Marie Bexte | Andrea Horbach | Torsten Zesch
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

With the recent emergence of powerful visio-linguistic models comes the question of how fine-grained their multi-modal understanding is. This has lead to the release of several probing datasets. Results point towards models having trouble with prepositions and verbs, but being relatively robust when it comes to color.To gauge how deep this understanding goes, we compile a comprehensive probing dataset to systematically test multi-modal alignment around color. We demonstrate how human perception influences descriptions of color and pay special attention to the extent to which this is reflected within the predictions of a visio-linguistic model. Probing a set of models with diverse properties with our benchmark confirms the superiority of models that do not rely on pre-extracted image features, and demonstrates that augmentation with too much noisy pre-training data can produce an inferior model. While the benchmark remains challenging for all models we test, the overall result pattern suggests well-founded alignment of color terms with hues. Analyses do however reveal uncertainty regarding the boundaries between neighboring color terms.

pdf abs
Text or Image? What is More Important in Cross-Domain Generalization Capabilities of Hate Meme Detection Models?
Piush Aggarwal | Jawar Mehrabanian | Weigang Huang | Özge Alacam | Torsten Zesch
Findings of the Association for Computational Linguistics: EACL 2024

This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection, presenting compelling findings. We provide evidence supporting the hypothesis that only the textual component of hateful memes enables the multimodal classifier to generalize across different domains, while the image component proves highly sensitive to a specific training dataset. The evidence includes demonstrations showing that hate-text classifiers perform similarly to hate-meme classifiers in a zero-shot setting. Simultaneously, the introduction of captions generated from images of memes to the hate-meme classifier worsens performance by an average F1 of 0.02. Through blackbox explanations, we identify a substantial contribution of the text modality (average of 83%), which diminishes with the introduction of meme’s image captions (52%). Additionally, our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with average ∆F1 of 0.18.

pdf abs
Unraveling the Dynamics of Semi-Supervised Hate Speech Detection: The Impact of Unlabeled Data Characteristics and Pseudo-Labeling Strategies
Florian Ludwig | Klara Dolos | Ana Alves-Pinto | Torsten Zesch
Findings of the Association for Computational Linguistics: EACL 2024

Despite advances in machine learning based hate speech detection, the need for larges amounts of labeled training data for state-of-the-art approaches remains a challenge for their application. Semi-supervised learning addresses this problem by leveraging unlabeled data and thus reducing the amount of annotated data required. Underlying this approach is the assumption that labeled and unlabeled data follow similar distributions. This assumption however may not always hold, with consequences for real world applications. We address this problem by investigating the dynamics of pseudo-labeling, a commonly employed form of semi-supervised learning, in the context of hate speech detection. Concretely we analysed the influence of data characteristics and of two strategies for selecting pseudo-labeled samples: threshold- and ratio-based. The results show that the influence of data characteristics on the pseudo-labeling performances depends on other factors, such as pseudo-label selection strategies or model biases. Furthermore, the effectiveness of pseudo-labeling in classification performance is determined by the interaction between the number, hate ratio and accuracy of the selected pseudo-labels. Analysis of the results suggests an advantage of the threshold-based approach when labeled and unlabeled data arise from the same domain, whilst the ratio-based approach may be recommended in the opposite situation.

Recent advances in natural language processing (NLP) can be largely attributed to the advent of pre-trained language models such as BERT and RoBERTa. While these models demonstrate remarkable performance on general datasets, they can struggle in specialized domains such as medicine, where unique domain-specific terminologies, domain-specific abbreviations, and varying document structures are common. This paper explores strategies for adapting these models to domain-specific requirements, primarily through continuous pre-training on domain-specific data. We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data. The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering. Our results suggest that models augmented by clinical and translation-based pre-training typically outperform general domain models in medical contexts. We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch. Furthermore, pre-training on clinical data or leveraging translated texts have proven to be reliable methods for domain adaptation in medical NLP tasks.

pdf abs
Every Verb in Its Right Place? A Roadmap for Operationalizing Developmental Stages in the Acquisition of L2 German
Josef Ruppenhofer | Matthias Schwendemann | Annette Portmann | Katrin Wisniewski | Torsten Zesch
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Developmental stages are a linguistic concept claiming that language learning, despite its large inter-individual variance, generally progresses in an ordered, step-like manner. At the core of research has been the acquisition of verb placement by learners, as conceptualized within Processability Theory (Pienemann, 1989). The computational implementation of a system detecting developmental stages is a prerequisite for an automated analysis of L2 language development. However, such an implementation faces two main challenges. The first is the lack of a fully fleshed out, coherent linguistic specification of the stages. The second concerns the translation of the linguistic specification into computational procedures that can extract clauses from learner-produced text and assign them to a developmental stage based on verb placement. Our contribution provides the necessary linguistic specification of the stages as well as detaiiled discussion and recommendations regarding computational implementation.

pdf abs
EVil-Probe - a Composite Benchmark for Extensive Visio-Linguistic Probing
Marie Bexte | Andrea Horbach | Torsten Zesch
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Research probing the language comprehension of visio-linguistic models has gained traction due to their remarkable performance on various tasks. We introduce EViL-Probe, a composite benchmark that processes existing probing datasets into a unified format and reorganizes them based on the linguistic categories they probe. On top of the commonly used negative probes, this benchmark introduces positive probes to more rigorously test the robustness of models. Since the language side alone may introduce a bias models could exploit in solving the probes, we estimate the difficulty of the individual subsets with a language-only baseline. Using the benchmark to probe a set of state-of-the-art visio-linguistic models sheds light on how sensitive they are to the different linguistic categories. Results show that the benchmark is challenging for all models we probe, as their performance is around the chance baseline for many of the categories. The only category all models are able to handle relatively well are nouns. Additionally, models that use a Vision Transformer to process the images are also somewhat robust against probes targeting color and image type. Among these models, our enrichment of EViL-Probe with positive probes helps further discriminate performance, showing BLIP to be the overall best-performing model.

2023

pdf abs
Recognizing Learner Handwriting Retaining Orthographic Errors for Enabling Fine-Grained Error Feedback
Christian Gold | Ronja Laarmann-Quante | Torsten Zesch
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper addresses the problem of providing automatic feedback on orthographic errors in handwritten text. Despite the availability of automatic error detection systems, the practical problem of digitizing the handwriting remains. Current handwriting recognition (HWR) systems produce highly accurate transcriptions but normalize away the very errors that are essential for providing useful feedback, e.g. orthographic errors. Our contribution is twofold:First, we create a comprehensive dataset of handwritten text with transcripts retaining orthographic errors by transcribing 1,350 pages from the German learner dataset FD-LEX. Second, we train a simple HWR system on our dataset, allowing it to transcribe words with orthographic errors. Thereby, we evaluate the effect of different dictionaries on recognition output, highlighting the importance of addressing spelling errors in these dictionaries.

pdf abs
Preserving the Authenticity of Handwritten Learner Language: Annotation Guidelines for Creating Transcripts Retaining Orthographic Features
Christian Gold | Ronja Laarmann-quante | Torsten Zesch
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

Handwritten texts produced by young learners often contain orthographic features like spelling errors, capitalization errors, punctuation errors, and impurities such as strikethroughs, inserts, and smudges. All of those are typically normalized or ignored in existing transcriptions. For applications like handwriting recognition with the goal of automatically analyzing a learner’s language performance, however, retaining such features would be necessary. To address this, we present transcription guidelines that retain the features addressed above. Our guidelines were developed iteratively and include numerous example images to illustrate the various issues. On a subset of about 90 double-transcribed texts, we compute inter-annotator agreement and show that our guidelines can be applied with high levels of percentage agreement of about .98. Overall, we transcribed 1,350 learner texts, which is about the same size as the widely adopted handwriting recognition datasets IAM (1,500 pages) and CVL (1,600 pages). Our final corpus can be used to train a handwriting recognition system that transcribes closely to the real productions by young learners. Such a system is a prerequisite for applying automatic orthography feedback systems to handwritten texts in the future.

Automatically scoring student answers is an important task that is usually solved using instance-based supervised learning. Recently, similarity-based scoring has been proposed as an alternative approach yielding similar perfor- mance. It has hypothetical advantages such as a lower need for annotated training data and better zero-shot performance, both of which are properties that would be highly beneficial when applying content scoring in a realistic classroom setting. In this paper we take a closer look at these alleged advantages by comparing different instance-based and similarity-based methods on multiple data sets in a number of learning curve experiments. We find that both the demand on data and cross-prompt performance is similar, thus not confirming the former two suggested advantages. The by default more straightforward possibility to give feedback based on a similarity-based approach may thus tip the scales in favor of it, although future work is needed to explore this advantage in practice.

pdf bib
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing
Piush Aggarwal | {\"O}zge Ala{\c{c}}am | Carina Silberer | Sina Zarrie{\ss} | Torsten Zesch
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing

2022

The dominating paradigm for content scoring is to learn an instance-based model, i.e. to use lexical features derived from the learner answers themselves. An alternative approach that receives much less attention is however to learn a similarity-based model. We introduce an architecture that efficiently learns a similarity model and find that results on the standard ASAP dataset are on par with a BERT-based classification approach.

pdf abs
‘Meet me at the ribary’ – Acceptability of spelling variants in free-text answers to listening comprehension prompts
Ronja Laarmann-Quante | Leska Schwarz | Andrea Horbach | Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

When listening comprehension is tested as a free-text production task, a challenge for scoring the answers is the resulting wide range of spelling variants. When judging whether a variant is acceptable or not, human raters perform a complex holistic decision. In this paper, we present a corpus study in which we analyze human acceptability decisions in a high stakes test for German. We show that for human experts, spelling variants are harder to score consistently than other answer variants. Furthermore, we examine how the decision can be operationalized using features that could be applied by an automatic scoring system. We show that simple measures like edit distance and phonetic similarity between a given answer and the target answer can model the human acceptability decisions with the same inter-annotator agreement as humans, and discuss implications of the remaining inconsistencies.

pdf bib
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)
Robin Schaefer | Xiaoyu Bai | Manfred Stede | Torsten Zesch
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

pdf
Bye, Bye, Maintenance Work? Using Model Cloning to Approximate the Behavior of Legacy Tools
Piush Aggarwal | Torsten Zesch
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

pdf abs
LeSpell - A Multi-Lingual Benchmark Corpus of Spelling Errors to Develop Spellchecking Methods for Learner Language
Marie Bexte | Ronja Laarmann-Quante | Andrea Horbach | Torsten Zesch
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Spellchecking text written by language learners is especially challenging because errors made by learners differ both quantitatively and qualitatively from errors made by already proficient learners. We introduce LeSpell, a multi-lingual (English, German, Italian, and Czech) evaluation data set of spelling mistakes in context that we compiled from seven underlying learner corpora. Our experiments show that existing spellcheckers do not work well with learner data. Thus, we introduce a highly customizable spellchecking component for the DKPro architecture, which improves performance in many settings.

pdf abs
A Legal Approach to Hate Speech – Operationalizing the EU’s Legal Framework against the Expression of Hatred as an NLP Task
Frederike Zufall | Marius Hamacher | Katharina Kloppenborg | Torsten Zesch
Proceedings of the Natural Legal Language Processing Workshop 2022

We propose a ‘legal approach’ to hate speech detection by operationalization of the decision as to whether a post is subject to criminal law into an NLP task. Comparing existing regulatory regimes for hate speech, we base our investigation on the European Union’s framework as it provides a widely applicable legal minimum standard. Accurately deciding whether a post is punishable or not usually requires legal education. We show that, by breaking the legal assessment down into a series of simpler sub-decisions, even laypersons can annotate consistently. Based on a newly annotated dataset, our experiments show that directly learning an automated model of punishable content is challenging. However, learning the two sub-tasks of ‘target group’ and ‘targeting conduct’ instead of a holistic, end-to-end approach to the legal assessment yields better results. Overall, our method also provides decisions that are more transparent than those of end-to-end models, which is a crucial point in legal decision-making.

pdf
Bringing Automatic Scoring into the Classroom - Measuring the Impact of Automated Analytic Feedback on Student Writing Performance
Andrea Horbach | Ronja Laarmann-Quante | Lucas Liebenow | Thorben Jansen | Stefan Keller | Jennifer Meyer | Torsten Zesch | Johanna Fleckenstein
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

pdf
Evaluating Automatic Spelling Correction Tools on German Primary School Children’s Misspellings
Ronja Laarmann-Quante | Lisa Prepens | Torsten Zesch
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

pdf abs
Analyzing the Real Vulnerability of Hate Speech Detection Systems against Targeted Intentional Noise
Piush Aggarwal | Torsten Zesch
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Hate speech detection systems have been shown to be vulnerable against obfuscation attacks, where a potential hater tries to circumvent detection by deliberately introducing noise in their posts. In previous work, noise is often introduced for all words (which is likely overestimating the impact) or single untargeted words (likely underestimating the vulnerability). We perform a user study asking people to select words they would obfuscate in a post. Using this realistic setting, we find that the real vulnerability of hate speech detection systems against deliberately introduced noise is almost as high as when using a whitebox attack and much more severe than when using a non-targeted dictionary. Our results are based on 4 different datasets, 12 different obfuscation strategies, and hate speech detection systems using different paradigms.

pdf abs
Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation
Florian Ludwig | Klara Dolos | Torsten Zesch | Eleanor Hobley
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

Despite recent advances in machine learning based hate speech detection, classifiers still struggle with generalizing knowledge to out-of-domain data samples. In this paper, we investigate the generalization capabilities of deep learning models to different target groups of hate speech under clean experimental settings. Furthermore, we assess the efficacy of three different strategies of unsupervised domain adaptation to improve these capabilities. Given the diversity of hate and its rapid dynamics in the online world (e.g. the evolution of new target groups like virologists during the COVID-19 pandemic), robustly detecting hate aimed at newly identified target groups is a highly relevant research question. We show that naively trained models suffer from a target group specific bias, which can be reduced via domain adaptation. We were able to achieve a relative improvement of the F1-score between 5.8% and 10.7% for out-of-domain target groups of hate speech compared to baseline approaches by utilizing domain adaptation.

2021

pdf abs
C-Test Collector: A Proficiency Testing Application to Collect Training Data for C-Tests
Christian Haring | Rene Lehmann | Andrea Horbach | Torsten Zesch
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

We present the C-Test Collector, a web-based tool that allows language learners to test their proficiency level using c-tests. Our tool collects anonymized data on test performance, which allows teachers to gain insights into common error patterns. At the same time, it allows NLP researchers to collect training data for being able to generate c-test variants at the desired difficulty level.

pdf bib
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)
Kilian Evang | Laura Kallmeyer | Rainer Osswald | Jakub Waszczuk | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf
Robustness of end-to-end Automatic Speech Recognition Models – A Case Study using Mozilla DeepSpeech
Aashish Agarwal | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf
Effects of Layer Freezing on Transferring a Speech Recognition System to Under-resourced Languages
Onno Eberhard | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf abs
A Crash Course on Ethics for Natural Language Processing
Annemarie Friedrich | Torsten Zesch
Proceedings of the Fifth Workshop on Teaching NLP

It is generally agreed upon in the natural language processing (NLP) community that ethics should be integrated into any curriculum. Being aware of and understanding the relevant core concepts is a prerequisite for following and participating in the discourse on ethical NLP. We here present ready-made teaching material in the form of slides and practical exercises on ethical issues in NLP, which is primarily intended to be integrated into introductory NLP or computational linguistics courses. By making this material freely available, we aim at lowering the threshold to adding ethics to the curriculum. We hope that increased awareness will enable students to identify potentially unethical behavior.

pdf bib abs
Implicit Phenomena in Short-answer Scoring Data
Marie Bexte | Andrea Horbach | Torsten Zesch
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language

Short-answer scoring is the task of assessing the correctness of a short text given as response to a question that can come from a variety of educational scenarios. As only content, not form, is important, the exact wording including the explicitness of an answer should not matter. However, many state-of-the-art scoring models heavily rely on lexical information, be it word embeddings in a neural network or n-grams in an SVM. Thus, the exact wording of an answer might very well make a difference. We therefore quantify to what extent implicit language phenomena occur in short answer datasets and examine the influence they have on automatic scoring performance. We find that the level of implicitness depends on the individual question, and that some phenomena are very frequent. Resolving implicit wording to explicit formulations indeed tends to improve automatic scoring performance.

pdf abs
VL-BERT+: Detecting Protected Groups in Hateful Multimodal Memes
Piush Aggarwal | Michelle Espranita Liman | Darina Gold | Torsten Zesch
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

This paper describes our submission (winning solution for Task A) to the Shared Task on Hateful Meme Detection at WOAH 2021. We build our system on top of a state-of-the-art system for binary hateful meme classification that already uses image tags such as race, gender, and web entities. We add further metadata such as emotions and experiment with data augmentation techniques, as hateful instances are underrepresented in the data set.

2020

pdf abs
Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels
Yuning Ding | Andrea Horbach | Torsten Zesch
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.

pdf bib
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf abs
Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input
Yuning Ding | Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch
Proceedings of the 28th International Conference on Computational Linguistics

Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.

pdf abs
Decomposing and Comparing Meaning Relations: Paraphrasing, Textual Entailment, Contradiction, and Specificity
Venelin Kovatchev | Darina Gold | M. Antonia Marti | Maria Salamo | Torsten Zesch
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a methodology for decomposing and comparing multiple meaning relations (paraphrasing, textual entailment, contradiction, and specificity). The methodology includes SHARel - a new typology that consists of 26 linguistic and 8 reason-based categories. We use the typology to annotate a corpus of 520 sentence pairs in English and we demonstrate that unlike previous typologies, SHARel can be applied to all relations of interest with a high inter-annotator agreement. We analyze and compare the frequency and distribution of the linguistic and reason-based phenomena involved in paraphrasing, textual entailment, contradiction, and specificity. This comparison allows for a much more in-depth analysis of the workings of the individual relations and the way they interact and compare with each other. We release all resources (typology, annotation guidelines, and annotated corpus) to the community.

2019

pdf abs
From legal to technical concept: Towards an automated classification of German political Twitter postings as criminal offenses
Frederike Zufall | Tobias Horsmann | Torsten Zesch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Advances in the automated detection of offensive Internet postings make this mechanism very attractive to social media companies, who are increasingly under pressure to monitor and action activity on their sites. However, these advances also have important implications as a threat to the fundamental right of free expression. In this article, we analyze which Twitter posts could actually be deemed offenses under German criminal law. German law follows the deductive method of the Roman law tradition based on abstract rules as opposed to the inductive reasoning in Anglo-American common law systems. This allows us to show how legal conclusions can be reached and implemented without relying on existing court decisions. We present a data annotation schema, consisting of a series of binary decisions, for determining whether a specific post would constitute a criminal offense. This schema serves as a step towards an inexpensive creation of a sufficient amount of data for an automated classification. We find that the majority of posts deemed offensive actually do not constitute a criminal offense and still contribute to public discourse. Furthermore, laymen can provide sufficiently reliable data to an expert reference but are, for instance, more lenient in the interpretation of what constitutes a disparaging statement.

pdf abs
Divide and Extract – Disentangling Clause Splitting and Proposition Extraction
Darina Gold | Torsten Zesch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Proposition extraction from sentences is an important task for information extraction systems Evaluation of such systems usually conflates two aspects: splitting complex sentences into clauses and the extraction of propositions. It is thus difficult to independently determine the quality of the proposition extraction step. We create a manually annotated proposition dataset from sentences taken from restaurant reviews that distinguishes between clauses that need to be split and those that do not. The resulting proposition evaluation dataset allows us to independently compare the performance of proposition extraction systems on simple and complex clauses. Although performance drastically drops on more complex sentences, we show that the same systems perform best on both simple and complex clauses. Furthermore, we show that specific kinds of subordinate clauses pose difficulties to most systems.

pdf abs
ltl.uni-due at SemEval-2019 Task 5: Simple but Effective Lexico-Semantic Features for Detecting Hate Speech in Twitter
Huangpan Zhang | Michael Wojatzki | Tobias Horsmann | Torsten Zesch
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our contribution to SemEval 2019 Task 5 Multilingual Detection of Hate, specifically in the Subtask A (English and Spanish). We compare different configurations of shallow and deep learning approaches on the English data and use the system that performs best in both sub-tasks. The resulting SVM-based system with lexicosemantic features (n-grams and embeddings) is ranked 23rd out of 69 on the English data and beats the baseline system. On the Spanish data our system is ranked 25th out of 39.

pdf abs
LTL-UDE at SemEval-2019 Task 6: BERT and Two-Vote Classification for Categorizing Offensiveness
Piush Aggarwal | Tobias Horsmann | Michael Wojatzki | Torsten Zesch
Proceedings of the 13th International Workshop on Semantic Evaluation

We present results for Subtask A and C of SemEval 2019 Shared Task 6. In Subtask A, we experiment with an embedding representation of postings and use BERT to categorize postings. Our best result reaches the 10th place (out of 103). In Subtask C, we applied a two-vote classification approach with minority fallback, which is placed on the 19th rank (out of 65).

pdf bib
RELATIONS - Workshop on meaning relations between phrases and sentences
Venelin Kovatchev | Darina Gold | Torsten Zesch
RELATIONS - Workshop on meaning relations between phrases and sentences

pdf abs
Annotating and analyzing the interactions between meaning relations
Darina Gold | Venelin Kovatchev | Torsten Zesch
Proceedings of the 13th Linguistic Annotation Workshop

Pairs of sentences, phrases, or other text pieces can hold semantic relations such as paraphrasing, textual entailment, contradiction, specificity, and semantic similarity. These relations are usually studied in isolation and no dataset exists where they can be compared empirically. Here we present a corpus annotated with these relations and the analysis of these results. The corpus contains 520 sentence pairs, annotated with these relations. We measure the annotation reliability of each individual relation and we examine their interactions and correlations. Among the unexpected results revealed by our analysis is that the traditionally considered direct relationship between paraphrasing and bi-directional entailment does not hold in our data.

pdf bib
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Helen Yannakoudakis | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Torsten Zesch
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

2018

pdf
Quantifying Qualitative Data for Understanding Controversial Issues
Michael Wojatzki | Saif Mohammad | Torsten Zesch | Svetlana Kiritchenko
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
ESCRITO - An NLP-Enhanced Educational Scoring Toolkit
Torsten Zesch | Andrea Horbach
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
DeepTC – An Extension of DKPro Text Classification for Fostering Reproducibility of Deep Learning Experiments
Tobias Horsmann | Torsten Zesch
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Agree or Disagree: Predicting Judgments on Nuanced Assertions
Michael Wojatzki | Torsten Zesch | Saif Mohammad | Svetlana Kiritchenko
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

Being able to predict whether people agree or disagree with an assertion (i.e. an explicit, self-contained statement) has several applications ranging from predicting how many people will like or dislike a social media post to classifying posts based on whether they are in accordance with a particular point of view. We formalize this as two NLP tasks: predicting judgments of (i) individuals and (ii) groups based on the text of the assertion and previous judgments. We evaluate a wide range of approaches on a crowdsourced data set containing over 100,000 judgments on over 2,000 assertions. We find that predicting individual judgments is a hard task with our best results only slightly exceeding a majority baseline, but that judgments of groups can be more reliably predicted using a Siamese neural network, which outperforms all other approaches by a wide margin.

pdf abs
Cross-Lingual Content Scoring
Andrea Horbach | Sebastian Stennmanns | Torsten Zesch
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We investigate the feasibility of cross-lingual content scoring, a scenario where training and test data in an automatic scoring task are from two different languages. Cross-lingual scoring can contribute to educational equality by allowing answers in multiple languages. Training a model in one language and applying it to another language might also help to overcome data sparsity issues by re-using trained models from other languages. As there is no suitable dataset available for this new task, we create a comparable bi-lingual corpus by extending the English ASAP dataset with German answers. Our experiments with cross-lingual scoring based on machine-translating either training or test data show a considerable drop in scoring quality.

pdf
The Role of Diacritics in Increasing the Difficulty of Arabic Lexical Recognition Tests
Osama Hamed | Torsten Zesch
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning

2017

pdf abs
Do LSTMs really work so well for PoS tagging? – A replication study
Tobias Horsmann | Torsten Zesch
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

A recent study by Plank et al. (2016) found that LSTM-based PoS taggers considerably improve over the current state-of-the-art when evaluated on the corpora of the Universal Dependencies project that use a coarse-grained tagset. We replicate this study using a fresh collection of 27 corpora of 21 languages that are annotated with fine-grained tagsets of varying size. Our replication confirms the result in general, and we additionally find that the advantage of LSTMs is even bigger for larger tagsets. However, we also find that for the very large tagsets of morphologically rich languages, hand-crafted morphological lexicons are still necessary to reach state-of-the-art performance.

pdf abs
Same same, but different: Compositionality of paraphrase granularity levels
Darina Benikova | Torsten Zesch
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Paraphrases exist on different granularity levels, the most frequently used one being the sentential level. However, we argue that working on the sentential level is not optimal for both machines and humans, and that it would be easier and more efficient to work on sub-sentential levels. To prove this, we quantify and analyze the difference between paraphrases on both sentence and sub-sentence level in order to show the significance of the problem. First results on a preliminary dataset seem to confirm our hypotheses.

pdf abs
Investigating neural architectures for short answer scoring
Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch | Chong Min Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embeddings – are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.

pdf abs
Fine-grained essay scoring of a complex writing task for native speakers
Andrea Horbach | Dirk Scholten-Akoun | Yuning Ding | Torsten Zesch
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Automatic essay scoring is nowadays successfully used even in high-stakes tests, but this is mainly limited to holistic scoring of learner essays. We present a new dataset of essays written by highly proficient German native speakers that is scored using a fine-grained rubric with the goal to provide detailed feedback. Our experiments with two state-of-the-art scoring systems (a neural and a SVM-based one) show a large drop in performance compared to existing datasets. This demonstrates the need for such datasets that allow to guide research on more elaborate essay scoring methods.

pdf abs
The Influence of Spelling Errors on Content Scoring Performance
Andrea Horbach | Yuning Ding | Torsten Zesch
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Spelling errors occur frequently in educational settings, but their influence on automatic scoring is largely unknown. We therefore investigate the influence of spelling errors on content scoring performance using the example of the ASAP corpus. We conduct an annotation study on the nature of spelling errors in the ASAP dataset and utilize these finding in machine learning experiments that measure the influence of spelling errors on automatic content scoring. Our main finding is that scoring methods using both token and character n-gram features are robust against spelling errors up to the error frequency in ASAP.

2016

pdf abs
Assigning Fine-grained PoS Tags based on High-precision Coarse-grained Tagging
Tobias Horsmann | Torsten Zesch
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We propose a new approach to PoS tagging where in a first step, we assign a coarse-grained tag corresponding to the main syntactic category. Based on this high-precision decision, in the second step we utilize specially trained fine-grained models with heavily reduced decision complexity. By analyzing the system under oracle conditions, we show that there is a quite large potential for significantly outperforming a competitive baseline. When we take error-propagation from the coarse-grained tagging into account, our approach is on par with the state of the art. Our approach also allows tailoring the tagger towards recognizing single word classes which are of interest e.g. for researchers searching for specific phenomena in large corpora. In a case study, we significantly outperform a standard model that also makes use of the same optimizations.

pdf abs
Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks
Ildikó Pilán | Elena Volodina | Torsten Zesch
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The lack of a sufficient amount of data tailored for a task is a well-recognized problem for many statistical NLP methods. In this paper, we explore whether data sparsity can be successfully tackled when classifying language proficiency levels in the domain of learner-written output texts. We aim at overcoming data sparsity by incorporating knowledge in the trained model from another domain consisting of input texts written by teaching professionals for learners. We compare different domain adaptation techniques and find that a weighted combination of the two types of data performs best, which can even rival systems based on considerably larger amounts of in-domain data. Moreover, we show that normalizing errors in learners’ texts can substantially improve classification when level-annotated in-domain data is not available.

pdf abs
FlexTag: A Highly Flexible PoS Tagging Framework
Torsten Zesch | Tobias Horsmann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present FlexTag, a highly flexible PoS tagging framework. In contrast to monolithic implementations that can only be retrained but not adapted otherwise, FlexTag enables users to modify the feature space and the classification algorithm. Thus, FlexTag makes it easy to quickly develop custom-made taggers exactly fitting the research problem.

pdf bib
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
Steven Bethard | Marine Carpuat | Daniel Cer | David Jurgens | Preslav Nakov | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
ltl.uni-due at SemEval-2016 Task 6: Stance Detection in Social Media Using Stacked Classifiers
Michael Wojatzki | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Predicting the Spelling Difficulty of Words for Language Learners
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises
Michael Wojatzki | Oren Melamud | Torsten Zesch
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf
LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
Tobias Horsmann | Torsten Zesch
Proceedings of the 10th Web as Corpus Workshop

pdf bib
Bridging the gap between computable and expressive event representations in Social Media
Darina Benikova | Torsten Zesch
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

pdf
Validating bundled gap filling – Empirical evidence for ambiguity reduction and language proficiency testing capabilities
Niklas Meyer | Michael Wojatzki | Torsten Zesch
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

2015

pdf bib
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
Preslav Nakov | Torsten Zesch | Daniel Cer | David Jurgens
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Candidate evaluation strategies for improved difficulty prediction of language tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Reducing Annotation Efforts in Supervised Short Answer Scoring
Torsten Zesch | Michael Heilman | Aoife Cahill
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Task-Independent Features for Automated Essay Grading
Torsten Zesch | Michael Wojatzki | Dirk Scholten-Akoun
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Counting What Counts: Decompounding for Keyphrase Extraction
Nicolai Erbs | Pedro Bispo Santos | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

2014

pdf
DKPro Keyphrases: Flexible and Reusable Keyphrase Extraction Experiments
Nicolai Erbs | Pedro Bispo Santos | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
Johannes Daxenberger | Oliver Ferschke | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf abs
Predicting the Difficulty of Language Proficiency Tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 2

Language proficiency tests are used to evaluate and compare the progress of language learners. We present an approach for automatic difficulty prediction of C-tests that performs on par with human experts. On the basis of detailed analysis of newly collected data, we develop a model for C-test difficulty introducing four dimensions: solution difficulty, candidate ambiguity, inter-gap dependency, and paragraph difficulty. We show that cues from all four dimensions contribute to C-test difficulty.

pdf
Sense and Similarity: A Study of Sense-level Similarity Measures
Nicolai Erbs | Iryna Gurevych | Torsten Zesch
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
Preslav Nakov | Torsten Zesch
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf
Automatic Generation of Challenging Distractors Using Context-Sensitive Inference Rules
Torsten Zesch | Oren Melamud
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Towards Automatic Scoring of Cloze Items by Selecting Low-Ambiguity Contexts
Tobias Horsmann | Torsten Zesch
Proceedings of the third workshop on NLP for computer-assisted language learning

Wikipedia has been used as a knowledge source in many areas of natural language processing. As most studies only use a certain Wikipedia snapshot, the influence of Wikipedias massive growth on the results is largely unknown. For the first time, we perform an in-depth analysis of this influence using semantic relatedness as an example application that tests a wide range of Wikipedias properties. We find that the growth of Wikipedia has almost no effect on the correlation of semantic relatedness measures with human judgments, while the coverage steadily increases.

pdf bib
Proceedings of the 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources
Iryna Gurevych | Torsten Zesch
Proceedings of the 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources

2009

pdf
Approximate Matching for Evaluating Keyphrase Extraction
Torsten Zesch | Iryna Gurevych
Proceedings of the International Conference RANLP-2009

pdf bib
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)
Iryna Gurevych | Torsten Zesch
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

2008

pdf abs
Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
Torsten Zesch | Christof Müller | Iryna Gurevych
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment for research in NLP. This paper addresses one such major impediment, namely the lack of suitable programmatic access mechanisms to the knowledge stored in these large semantic knowledge bases. We present two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge. As we believe them to be of general interest to the NLP community, we have made them freely available for research purposes.