Torsten Zesch

2022

The dominating paradigm for content scoring is to learn an instance-based model, i.e. to use lexical features derived from the learner answers themselves. An alternative approach that receives much less attention is however to learn a similarity-based model. We introduce an architecture that efficiently learns a similarity model and find that results on the standard ASAP dataset are on par with a BERT-based classification approach.

pdf abs
‘Meet me at the ribary’ – Acceptability of spelling variants in free-text answers to listening comprehension prompts
Ronja Laarmann-Quante | Leska Schwarz | Andrea Horbach | Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

When listening comprehension is tested as a free-text production task, a challenge for scoring the answers is the resulting wide range of spelling variants. When judging whether a variant is acceptable or not, human raters perform a complex holistic decision. In this paper, we present a corpus study in which we analyze human acceptability decisions in a high stakes test for German. We show that for human experts, spelling variants are harder to score consistently than other answer variants.Furthermore, we examine how the decision can be operationalized using features that could be applied by an automatic scoring system. We show that simple measures like edit distance and phonetic similarity between a given answer and the target answer can model the human acceptability decisions with the same inter-annotator agreement as humans, and discuss implications of the remaining inconsistencies.

pdf abs
LeSpell - A Multi-Lingual Benchmark Corpus of Spelling Errors to Develop Spellchecking Methods for Learner Language
Marie Bexte | Ronja Laarmann-Quante | Andrea Horbach | Torsten Zesch
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Spellchecking text written by language learners is especially challenging because errors made by learners differ both quantitatively and qualitatively from errors made by already proficient learners. We introduce LeSpell, a multi-lingual (English, German, Italian, and Czech) evaluation data set of spelling mistakes in context that we compiled from seven underlying learner corpora. Our experiments show that existing spellcheckers do not work well with learner data. Thus, we introduce a highly customizable spellchecking component for the DKPro architecture, which improves performance in many settings.

pdf
Bringing Automatic Scoring into the Classroom - Measuring the Impact of Automated Analytic Feedback on Student Writing Performance
Andrea Horbach | Ronja Laarmann-Quante | Lucas Liebenow | Thorben Jansen | Stefan Keller | Jennifer Meyer | Torsten Zesch | Johanna Fleckenstein
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

pdf
Evaluating Automatic Spelling Correction Tools on German Primary School Children’s Misspellings
Ronja Laarmann-Quante | Lisa Prepens | Torsten Zesch
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

pdf abs
Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation
Florian Ludwig | Klara Dolos | Torsten Zesch | Eleanor Hobley
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

Despite recent advances in machine learning based hate speech detection, classifiers still struggle with generalizing knowledge to out-of-domain data samples. In this paper, we investigate the generalization capabilities of deep learning models to different target groups of hate speech under clean experimental settings. Furthermore, we assess the efficacy of three different strategies of unsupervised domain adaptation to improve these capabilities. Given the diversity of hate and its rapid dynamics in the online world (e.g. the evolution of new target groups like virologists during the COVID-19 pandemic), robustly detecting hate aimed at newly identified target groups is a highly relevant research question. We show that naively trained models suffer from a target group specific bias, which can be reduced via domain adaptation. We were able to achieve a relative improvement of the F1-score between 5.8% and 10.7% for out-of-domain target groups of hate speech compared to baseline approaches by utilizing domain adaptation.

pdf abs
Analyzing the Real Vulnerability of Hate Speech Detection Systems against Targeted Intentional Noise
Piush Aggarwal | Torsten Zesch
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Hate speech detection systems have been shown to be vulnerable against obfuscation attacks, where a potential hater tries to circumvent detection by deliberately introducing noise in their posts. In previous work, noise is often introduced for all words (which is likely overestimating the impact) or single untargeted words (likely underestimating the vulnerability). We perform a user study asking people to select words they would obfuscate in a post. Using this realistic setting, we find that the real vulnerability of hate speech detection systems against deliberately introduced noise is almost as high as when using a whitebox attack and much more severe than when using a non-targeted dictionary. Our results are based on 4 different datasets, 12 different obfuscation strategies, and hate speech detection systems using different paradigms.

pdf abs
A Legal Approach to Hate Speech – Operationalizing the EU’s Legal Framework against the Expression of Hatred as an NLP Task
Frederike Zufall | Marius Hamacher | Katharina Kloppenborg | Torsten Zesch
Proceedings of the Natural Legal Language Processing Workshop 2022

We propose a ‘legal approach’ to hate speech detection by operationalization of the decision as to whether a post is subject to criminal law into an NLP task. Comparing existing regulatory regimes for hate speech, we base our investigation on the European Union’s framework as it provides a widely applicable legal minimum standard. Accurately deciding whether a post is punishable or not usually requires legal education. We show that, by breaking the legal assessment down into a series of simpler sub-decisions, even laypersons can annotate consistently. Based on a newly annotated dataset, our experiments show that directly learning an automated model of punishable content is challenging. However, learning the two sub-tasks of ‘target group’ and ‘targeting conduct’ instead of a holistic, end-to-end approach to the legal assessment yields better results. Overall, our method also provides decisions that are more transparent than those of end-to-end models, which is a crucial point in legal decision-making.

pdf bib
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)
Robin Schaefer | Xiaoyu Bai | Manfred Stede | Torsten Zesch
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

pdf
Bye, Bye, Maintenance Work? Using Model Cloning to Approximate the Behavior of Legacy Tools
Piush Aggarwal | Torsten Zesch
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

2021

pdf bib abs
Implicit Phenomena in Short-answer Scoring Data
Marie Bexte | Andrea Horbach | Torsten Zesch
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language

Short-answer scoring is the task of assessing the correctness of a short text given as response to a question that can come from a variety of educational scenarios. As only content, not form, is important, the exact wording including the explicitness of an answer should not matter. However, many state-of-the-art scoring models heavily rely on lexical information, be it word embeddings in a neural network or n-grams in an SVM. Thus, the exact wording of an answer might very well make a difference. We therefore quantify to what extent implicit language phenomena occur in short answer datasets and examine the influence they have on automatic scoring performance. We find that the level of implicitness depends on the individual question, and that some phenomena are very frequent. Resolving implicit wording to explicit formulations indeed tends to improve automatic scoring performance.

pdf abs
C-Test Collector: A Proficiency Testing Application to Collect Training Data for C-Tests
Christian Haring | Rene Lehmann | Andrea Horbach | Torsten Zesch
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

We present the C-Test Collector, a web-based tool that allows language learners to test their proficiency level using c-tests. Our tool collects anonymized data on test performance, which allows teachers to gain insights into common error patterns. At the same time, it allows NLP researchers to collect training data for being able to generate c-test variants at the desired difficulty level.

pdf bib
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)
Kilian Evang | Laura Kallmeyer | Rainer Osswald | Jakub Waszczuk | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf
Robustness of end-to-end Automatic Speech Recognition Models – A Case Study using Mozilla DeepSpeech
Aashish Agarwal | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf
Effects of Layer Freezing on Transferring a Speech Recognition System to Under-resourced Languages
Onno Eberhard | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf abs
A Crash Course on Ethics for Natural Language Processing
Annemarie Friedrich | Torsten Zesch
Proceedings of the Fifth Workshop on Teaching NLP

It is generally agreed upon in the natural language processing (NLP) community that ethics should be integrated into any curriculum. Being aware of and understanding the relevant core concepts is a prerequisite for following and participating in the discourse on ethical NLP. We here present ready-made teaching material in the form of slides and practical exercises on ethical issues in NLP, which is primarily intended to be integrated into introductory NLP or computational linguistics courses. By making this material freely available, we aim at lowering the threshold to adding ethics to the curriculum. We hope that increased awareness will enable students to identify potentially unethical behavior.

pdf abs
VL-BERT+: Detecting Protected Groups in Hateful Multimodal Memes
Piush Aggarwal | Michelle Espranita Liman | Darina Gold | Torsten Zesch
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

This paper describes our submission (winning solution for Task A) to the Shared Task on Hateful Meme Detection at WOAH 2021. We build our system on top of a state-of-the-art system for binary hateful meme classification that already uses image tags such as race, gender, and web entities. We add further metadata such as emotions and experiment with data augmentation techniques, as hateful instances are underrepresented in the data set.

2020

pdf abs
Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels
Yuning Ding | Andrea Horbach | Torsten Zesch
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.

pdf bib
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf abs
Decomposing and Comparing Meaning Relations: Paraphrasing, Textual Entailment, Contradiction, and Specificity
Venelin Kovatchev | Darina Gold | M. Antonia Marti | Maria Salamo | Torsten Zesch
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a methodology for decomposing and comparing multiple meaning relations (paraphrasing, textual entailment, contradiction, and specificity). The methodology includes SHARel - a new typology that consists of 26 linguistic and 8 reason-based categories. We use the typology to annotate a corpus of 520 sentence pairs in English and we demonstrate that unlike previous typologies, SHARel can be applied to all relations of interest with a high inter-annotator agreement. We analyze and compare the frequency and distribution of the linguistic and reason-based phenomena involved in paraphrasing, textual entailment, contradiction, and specificity. This comparison allows for a much more in-depth analysis of the workings of the individual relations and the way they interact and compare with each other. We release all resources (typology, annotation guidelines, and annotated corpus) to the community.

pdf abs
Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input
Yuning Ding | Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch
Proceedings of the 28th International Conference on Computational Linguistics

Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.

2019

pdf abs
From legal to technical concept: Towards an automated classification of German political Twitter postings as criminal offenses
Frederike Zufall | Tobias Horsmann | Torsten Zesch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Advances in the automated detection of offensive Internet postings make this mechanism very attractive to social media companies, who are increasingly under pressure to monitor and action activity on their sites. However, these advances also have important implications as a threat to the fundamental right of free expression. In this article, we analyze which Twitter posts could actually be deemed offenses under German criminal law. German law follows the deductive method of the Roman law tradition based on abstract rules as opposed to the inductive reasoning in Anglo-American common law systems. This allows us to show how legal conclusions can be reached and implemented without relying on existing court decisions. We present a data annotation schema, consisting of a series of binary decisions, for determining whether a specific post would constitute a criminal offense. This schema serves as a step towards an inexpensive creation of a sufficient amount of data for an automated classification. We find that the majority of posts deemed offensive actually do not constitute a criminal offense and still contribute to public discourse. Furthermore, laymen can provide sufficiently reliable data to an expert reference but are, for instance, more lenient in the interpretation of what constitutes a disparaging statement.

pdf abs
Divide and Extract – Disentangling Clause Splitting and Proposition Extraction
Darina Gold | Torsten Zesch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Proposition extraction from sentences is an important task for information extraction systems Evaluation of such systems usually conflates two aspects: splitting complex sentences into clauses and the extraction of propositions. It is thus difficult to independently determine the quality of the proposition extraction step. We create a manually annotated proposition dataset from sentences taken from restaurant reviews that distinguishes between clauses that need to be split and those that do not. The resulting proposition evaluation dataset allows us to independently compare the performance of proposition extraction systems on simple and complex clauses. Although performance drastically drops on more complex sentences, we show that the same systems perform best on both simple and complex clauses. Furthermore, we show that specific kinds of subordinate clauses pose difficulties to most systems.

pdf bib
RELATIONS - Workshop on meaning relations between phrases and sentences
Venelin Kovatchev | Darina Gold | Torsten Zesch
RELATIONS - Workshop on meaning relations between phrases and sentences

pdf abs
Annotating and analyzing the interactions between meaning relations
Darina Gold | Venelin Kovatchev | Torsten Zesch
Proceedings of the 13th Linguistic Annotation Workshop

Pairs of sentences, phrases, or other text pieces can hold semantic relations such as paraphrasing, textual entailment, contradiction, specificity, and semantic similarity. These relations are usually studied in isolation and no dataset exists where they can be compared empirically. Here we present a corpus annotated with these relations and the analysis of these results. The corpus contains 520 sentence pairs, annotated with these relations. We measure the annotation reliability of each individual relation and we examine their interactions and correlations. Among the unexpected results revealed by our analysis is that the traditionally considered direct relationship between paraphrasing and bi-directional entailment does not hold in our data.

pdf bib
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Helen Yannakoudakis | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Torsten Zesch
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf abs
ltl.uni-due at SemEval-2019 Task 5: Simple but Effective Lexico-Semantic Features for Detecting Hate Speech in Twitter
Huangpan Zhang | Michael Wojatzki | Tobias Horsmann | Torsten Zesch
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our contribution to SemEval 2019 Task 5 Multilingual Detection of Hate, specifically in the Subtask A (English and Spanish). We compare different configurations of shallow and deep learning approaches on the English data and use the system that performs best in both sub-tasks. The resulting SVM-based system with lexicosemantic features (n-grams and embeddings) is ranked 23rd out of 69 on the English data and beats the baseline system. On the Spanish data our system is ranked 25th out of 39.

pdf abs
LTL-UDE at SemEval-2019 Task 6: BERT and Two-Vote Classification for Categorizing Offensiveness
Piush Aggarwal | Tobias Horsmann | Michael Wojatzki | Torsten Zesch
Proceedings of the 13th International Workshop on Semantic Evaluation

We present results for Subtask A and C of SemEval 2019 Shared Task 6. In Subtask A, we experiment with an embedding representation of postings and use BERT to categorize postings. Our best result reaches the 10th place (out of 103). In Subtask C, we applied a two-vote classification approach with minority fallback, which is placed on the 19th rank (out of 65).

2018

pdf
Quantifying Qualitative Data for Understanding Controversial Issues
Michael Wojatzki | Saif Mohammad | Torsten Zesch | Svetlana Kiritchenko
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
ESCRITO - An NLP-Enhanced Educational Scoring Toolkit
Torsten Zesch | Andrea Horbach
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
DeepTC – An Extension of DKPro Text Classification for Fostering Reproducibility of Deep Learning Experiments
Tobias Horsmann | Torsten Zesch
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Agree or Disagree: Predicting Judgments on Nuanced Assertions
Michael Wojatzki | Torsten Zesch | Saif Mohammad | Svetlana Kiritchenko
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

Being able to predict whether people agree or disagree with an assertion (i.e. an explicit, self-contained statement) has several applications ranging from predicting how many people will like or dislike a social media post to classifying posts based on whether they are in accordance with a particular point of view. We formalize this as two NLP tasks: predicting judgments of (i) individuals and (ii) groups based on the text of the assertion and previous judgments. We evaluate a wide range of approaches on a crowdsourced data set containing over 100,000 judgments on over 2,000 assertions. We find that predicting individual judgments is a hard task with our best results only slightly exceeding a majority baseline, but that judgments of groups can be more reliably predicted using a Siamese neural network, which outperforms all other approaches by a wide margin.

pdf abs
Cross-Lingual Content Scoring
Andrea Horbach | Sebastian Stennmanns | Torsten Zesch
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We investigate the feasibility of cross-lingual content scoring, a scenario where training and test data in an automatic scoring task are from two different languages. Cross-lingual scoring can contribute to educational equality by allowing answers in multiple languages. Training a model in one language and applying it to another language might also help to overcome data sparsity issues by re-using trained models from other languages. As there is no suitable dataset available for this new task, we create a comparable bi-lingual corpus by extending the English ASAP dataset with German answers. Our experiments with cross-lingual scoring based on machine-translating either training or test data show a considerable drop in scoring quality.

pdf
The Role of Diacritics in Increasing the Difficulty of Arabic Lexical Recognition Tests
Osama Hamed | Torsten Zesch
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning

2017

pdf abs
Investigating neural architectures for short answer scoring
Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch | Chong Min Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embeddings – are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.

pdf abs
Fine-grained essay scoring of a complex writing task for native speakers
Andrea Horbach | Dirk Scholten-Akoun | Yuning Ding | Torsten Zesch
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Automatic essay scoring is nowadays successfully used even in high-stakes tests, but this is mainly limited to holistic scoring of learner essays. We present a new dataset of essays written by highly proficient German native speakers that is scored using a fine-grained rubric with the goal to provide detailed feedback. Our experiments with two state-of-the-art scoring systems (a neural and a SVM-based one) show a large drop in performance compared to existing datasets. This demonstrates the need for such datasets that allow to guide research on more elaborate essay scoring methods.

pdf abs
The Influence of Spelling Errors on Content Scoring Performance
Andrea Horbach | Yuning Ding | Torsten Zesch
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Spelling errors occur frequently in educational settings, but their influence on automatic scoring is largely unknown. We therefore investigate the influence of spelling errors on content scoring performance using the example of the ASAP corpus. We conduct an annotation study on the nature of spelling errors in the ASAP dataset and utilize these finding in machine learning experiments that measure the influence of spelling errors on automatic content scoring. Our main finding is that scoring methods using both token and character n-gram features are robust against spelling errors up to the error frequency in ASAP.

pdf abs
Do LSTMs really work so well for PoS tagging? – A replication study
Tobias Horsmann | Torsten Zesch
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

A recent study by Plank et al. (2016) found that LSTM-based PoS taggers considerably improve over the current state-of-the-art when evaluated on the corpora of the Universal Dependencies project that use a coarse-grained tagset. We replicate this study using a fresh collection of 27 corpora of 21 languages that are annotated with fine-grained tagsets of varying size. Our replication confirms the result in general, and we additionally find that the advantage of LSTMs is even bigger for larger tagsets. However, we also find that for the very large tagsets of morphologically rich languages, hand-crafted morphological lexicons are still necessary to reach state-of-the-art performance.

pdf abs
Same same, but different: Compositionality of paraphrase granularity levels
Darina Benikova | Torsten Zesch
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Paraphrases exist on different granularity levels, the most frequently used one being the sentential level. However, we argue that working on the sentential level is not optimal for both machines and humans, and that it would be easier and more efficient to work on sub-sentential levels. To prove this, we quantify and analyze the difference between paraphrases on both sentence and sub-sentence level in order to show the significance of the problem. First results on a preliminary dataset seem to confirm our hypotheses.

2016

pdf abs
FlexTag: A Highly Flexible PoS Tagging Framework
Torsten Zesch | Tobias Horsmann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present FlexTag, a highly flexible PoS tagging framework. In contrast to monolithic implementations that can only be retrained but not adapted otherwise, FlexTag enables users to modify the feature space and the classification algorithm. Thus, FlexTag makes it easy to quickly develop custom-made taggers exactly fitting the research problem.

pdf
Predicting the Spelling Difficulty of Words for Language Learners
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises
Michael Wojatzki | Oren Melamud | Torsten Zesch
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf
LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
Tobias Horsmann | Torsten Zesch
Proceedings of the 10th Web as Corpus Workshop

pdf bib
Bridging the gap between computable and expressive event representations in Social Media
Darina Benikova | Torsten Zesch
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

pdf
Validating bundled gap filling – Empirical evidence for ambiguity reduction and language proficiency testing capabilities
Niklas Meyer | Michael Wojatzki | Torsten Zesch
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

pdf bib
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
Steven Bethard | Marine Carpuat | Daniel Cer | David Jurgens | Preslav Nakov | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
ltl.uni-due at SemEval-2016 Task 6: Stance Detection in Social Media Using Stacked Classifiers
Michael Wojatzki | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf abs
Assigning Fine-grained PoS Tags based on High-precision Coarse-grained Tagging
Tobias Horsmann | Torsten Zesch
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We propose a new approach to PoS tagging where in a first step, we assign a coarse-grained tag corresponding to the main syntactic category. Based on this high-precision decision, in the second step we utilize specially trained fine-grained models with heavily reduced decision complexity. By analyzing the system under oracle conditions, we show that there is a quite large potential for significantly outperforming a competitive baseline. When we take error-propagation from the coarse-grained tagging into account, our approach is on par with the state of the art. Our approach also allows tailoring the tagger towards recognizing single word classes which are of interest e.g. for researchers searching for specific phenomena in large corpora. In a case study, we significantly outperform a standard model that also makes use of the same optimizations.

pdf abs
Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks
Ildikó Pilán | Elena Volodina | Torsten Zesch
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The lack of a sufficient amount of data tailored for a task is a well-recognized problem for many statistical NLP methods. In this paper, we explore whether data sparsity can be successfully tackled when classifying language proficiency levels in the domain of learner-written output texts. We aim at overcoming data sparsity by incorporating knowledge in the trained model from another domain consisting of input texts written by teaching professionals for learners. We compare different domain adaptation techniques and find that a weighted combination of the two types of data performs best, which can even rival systems based on considerably larger amounts of in-domain data. Moreover, we show that normalizing errors in learners’ texts can substantially improve classification when level-annotated in-domain data is not available.

2015

pdf bib
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
Preslav Nakov | Torsten Zesch | Daniel Cer | David Jurgens
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Candidate evaluation strategies for improved difficulty prediction of language tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Reducing Annotation Efforts in Supervised Short Answer Scoring
Torsten Zesch | Michael Heilman | Aoife Cahill
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Task-Independent Features for Automated Essay Grading
Torsten Zesch | Michael Wojatzki | Dirk Scholten-Akoun
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Counting What Counts: Decompounding for Keyphrase Extraction
Nicolai Erbs | Pedro Bispo Santos | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

2014

pdf
Sense and Similarity: A Study of Sense-level Similarity Measures
Nicolai Erbs | Iryna Gurevych | Torsten Zesch
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
Preslav Nakov | Torsten Zesch
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf
DKPro Keyphrases: Flexible and Reusable Keyphrase Extraction Experiments
Nicolai Erbs | Pedro Bispo Santos | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
Johannes Daxenberger | Oliver Ferschke | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf
Automatic Generation of Challenging Distractors Using Context-Sensitive Inference Rules
Torsten Zesch | Oren Melamud
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Towards Automatic Scoring of Cloze Items by Selecting Low-Ambiguity Contexts
Tobias Horsmann | Torsten Zesch
Proceedings of the third workshop on NLP for computer-assisted language learning

pdf abs
Predicting the Difficulty of Language Proficiency Tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 2

Language proficiency tests are used to evaluate and compare the progress of language learners. We present an approach for automatic difficulty prediction of C-tests that performs on par with human experts. On the basis of detailed analysis of newly collected data, we develop a model for C-test difficulty introducing four dimensions: solution difficulty, candidate ambiguity, inter-gap dependency, and paragraph difficulty. We show that cues from all four dimensions contribute to C-test difficulty.

Wikipedia has been used as a knowledge source in many areas of natural language processing. As most studies only use a certain Wikipedia snapshot, the influence of Wikipedias massive growth on the results is largely unknown. For the first time, we perform an in-depth analysis of this influence using semantic relatedness as an example application that tests a wide range of Wikipedias properties. We find that the growth of Wikipedia has almost no effect on the correlation of semantic relatedness measures with human judgments, while the coverage steadily increases.

2009

pdf
Approximate Matching for Evaluating Keyphrase Extraction
Torsten Zesch | Iryna Gurevych
Proceedings of the International Conference RANLP-2009

pdf bib
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)
Iryna Gurevych | Torsten Zesch
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

2008

pdf abs
Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
Torsten Zesch | Christof Müller | Iryna Gurevych
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment for research in NLP. This paper addresses one such major impediment, namely the lack of suitable programmatic access mechanisms to the knowledge stored in these large semantic knowledge bases. We present two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge. As we believe them to be of general interest to the NLP community, we have made them freely available for research purposes.