Philipp Heinrich


2020

pdf
Corpus Query Lingua Franca part II: Ontology
Stefan Evert | Oleg Harlamov | Philipp Heinrich | Piotr Banski
Proceedings of the Twelfth Language Resources and Evaluation Conference

The present paper outlines the projected second part of the Corpus Query Lingua Franca (CQLF) family of standards: CQLF Ontology, which is currently in the process of standardization at the International Standards Organization (ISO), in its Technical Committee 37, Subcommittee 4 (TC37SC4) and its national mirrors. The first part of the family, ISO 24623-1 (henceforth CQLF Metamodel), was successfully adopted as an international standard at the beginning of 2018. The present paper reflects the state of the CQLF Ontology at the moment of submission for the Committee Draft ballot. We provide a brief overview of the CQLF Metamodel, present the assumptions and aims of the CQLF Ontology, its basic structure, and its potential extended applications. The full ontology is expected to emerge from a community process, starting from an initial version created by the authors of the present paper.

pdf
EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus
Thomas Proisl | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Andreas Blombach | Stefan Evert
Proceedings of the Twelfth Language Resources and Evaluation Conference

The EmpiriST corpus (Beißwenger et al., 2016) is a manually tokenized and part-of-speech tagged corpus of approximately 23,000 tokens of German Web and CMC (computer-mediated communication) data. We extend the corpus with manually created annotation layers for word form normalization, lemmatization and lexical semantics. All annotations have been independently performed by multiple human annotators. We report inter-annotator agreements and results of baseline systems and state-of-the-art off-the-shelf tools.

pdf
A Corpus of German Reddit Exchanges (GeRedE)
Andreas Blombach | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Thomas Proisl
Proceedings of the Twelfth Language Resources and Evaluation Conference

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.

2019

pdf
The_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri without even knowing the alphabet
Thomas Proisl | Peter Uhrig | Andreas Blombach | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Sefora Mammarella
Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers

2018

pdf
EmotiKLUE at IEST 2018: Topic-Informed Classification of Implicit Emotions
Thomas Proisl | Philipp Heinrich | Besim Kabashi | Stefan Evert
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

EmotiKLUE is a submission to the Implicit Emotion Shared Task. It is a deep learning system that combines independent representations of the left and right contexts of the emotion word with the topic distribution of an LDA topic model. EmotiKLUE achieves a macro average F₁score of 67.13%, significantly outperforming the baseline produced by a simple ML classifier. Further enhancements after the evaluation period lead to an improved F₁score of 68.10%.