Robert Remus

2015

pdf bib
ExB Themis: Extensive Feature Extraction from Word Alignments for Semantic Textual Similarity
Christian Hänig | Robert Remus | Xose De La Puente
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
ExB Text Summarizer
Stefan Thomas | Christian Beutenmüller | Xose de la Puente | Robert Remus | Stefan Bordag
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

pdf bib abs
Learning from Domain Complexity
Robert Remus | Dominique Ziegelmayer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sentiment analysis is genre and domain dependent, i.e. the same method performs differently when applied to text that originates from different genres and domains. Intuitively, this is due to different language use in different genres and domains. We measure such differences in a sentiment analysis gold standard dataset that contains texts from 1 genre and 10 domains. Differences in language use are quantified using certain language statistics, viz. domain complexity measures. We investigate 4 domain complexity measures: percentage of rare words, word richness, relative entropy and corpus homogeneity. We relate domain complexity measurements to performance of a standard machine learning-based classifier and find strong correlations. We show that we can accurately estimate its performance based on domain complexity using linear regression models fitted using robust loss functions. Moreover, we illustrate how domain complexity may guide us in model selection, viz. in deciding what word n-gram order to employ in a discriminative model and whether to employ aggressive or conservative word n-gram feature selection.

2013

pdf bib
ASVUniOfLeipzig: Sentiment Analysis in Twitter using Data-driven Machine Learning Techniques
Robert Remus
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

In this paper, we describe MLSA, a publicly available multi-layered reference corpus for German-language sentiment analysis. The construction of the corpus is based on the manual annotation of 270 German-language sentences considering three different layers of granularity. The sentence-layer annotation, as the most coarse-grained annotation, focuses on aspects of objectivity, subjectivity and the overall polarity of the respective sentences. Layer 2 is concerned with polarity on the word- and phrase-level, annotating both subjective and factual language. The annotations on Layer 3 focus on the expression-level, denoting frames of private states such as objective and direct speech events. These three layers and their respective annotations are intended to be fully independent of each other. At the same time, exploring for and discovering interactions that may exist between different layers should also be possible. The reliability of the respective annotations was assessed using the average pairwise agreement and Fleiss' multi-rater measures. We believe that MLSA is a beneficial resource for sentiment analysis research, algorithms and applications that focus on the German language.

pdf bib abs
Learning Categories and their Instances by Contextual Features
Antje Schlaf | Robert Remus
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a 3-step framework that learns categories and their instances from natural language text based on given training examples. Step 1 extracts contexts of training examples as rules describing this category from text, considering part of speech, capitalization and category membership as features. Step 2 selects high quality rules using two consequent filters. The first filter is based on the number of rule occurrences, the second filter takes two non-independent characteristics into account: a rule's precision and the amount of instances it acquires. Our framework adapts the filter's threshold values to the respective category and the textual genre by automatically evaluating rule sets resulting from different filter settings and selecting the best performing rule set accordingly. Step 3 then identifies new instances of a category using the filtered rules applied within a previously proposed algorithm. We inspect the rule filters' impact on rule set quality and evaluate our framework by learning first names, last names, professions and cities from a hitherto unexplored textual genre -- search engine result snippets -- and achieve high precision on average.

pdf bib abs
Textual Characteristics for Language Engineering
Mathias Bank | Robert Remus | Martin Schierle
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Language statistics are widely used to characterize and better understand language. In parallel, the amount of text mining and information retrieval methods grew rapidly within the last decades, with many algorithms evaluated on standardized corpora, often drawn from newspapers. However, up to now there were almost no attempts to link the areas of natural language processing and language statistics in order to properly characterize those evaluation corpora, and to help others to pick the most appropriate algorithms for their particular corpus. We believe no results in the field of natural language processing should be published without quantitatively describing the used corpora. Only then the real value of proposed methods can be determined and the transferability to corpora originating from different genres or domains can be estimated. We lay ground for a language engineering process by gathering and defining a set of textual characteristics we consider valuable with respect to building natural language processing systems. We carry out a case study for the analysis of automotive repair orders and explicitly call upon the scientific community to provide feedback and help to establish a good practice of corpus-aware evaluations.

2011

pdf bib
Improving Sentence-level Subjectivity Classification through Readability Measurement
Robert Remus
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib abs
SentiWS - A Publicly Available German-language Resource for Sentiment Analysis
Robert Remus | Uwe Quasthoff | Gerhard Heyer
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative sentiment bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 negative and 1,818 positive words, which sum up to 16,406 positive and 16,328 negative word forms, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one. The present work describes the resources structure, the three sources utilised to assemble it and the semi-supervised method incorporated to weight the strength of its entries. Furthermore the resources contents are extensively evaluated using a German-language evaluation set we constructed. The evaluation set is verified being reliable and its shown that SentiWS provides a beneficial lexical resource for German-language sentiment analysis related tasks to build on.