Jonas Kuhn

2024

pdf abs
The DURel Annotation Tool: Human and Computational Measurement of Semantic Proximity, Sense Clusters and Semantic Change
Dominik Schlechtweg | Shafqat Mumtaz Virk | Pauline Sander | Emma Sköldberg | Lukas Theuer Linke | Tuo Zhang | Nina Tahmasebi | Jonas Kuhn | Sabine Schulte Im Walde
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We present the DURel tool implementing the annotation of semantic proximity between word uses into an online, open source interface. The tool supports standardized human annotation as well as computational annotation, building on recent advances with Word-in-Context models. Annotator judgments are clustered with automatic graph clustering techniques and visualized for analysis. This allows to measure word senses with simple and intuitive micro-task judgments between use pairs, requiring minimal preparation efforts. The tool offers additional functionalities to compare the agreement between annotators to guarantee the inter-subjectivity of the obtained judgments and to calculate summary statistics over the annotated data giving insights into sense frequency distributions, semantic variation or changes of senses over time.

2023

pdf abs
Reinforced Active Learning for Low-Resource, Domain-Specific, Multi-Label Text Classification
Lukas Wertz | Jasmina Bogojeska | Katsiaryna Mirylenka | Jonas Kuhn
Findings of the Association for Computational Linguistics: ACL 2023

Text classification datasets from specialised or technical domains are in high demand, especially in industrial applications. However, due to the high cost of annotation such datasets are usually expensive to create. While Active Learning (AL) can reduce the labeling cost, required AL strategies are often only tested on general knowledge domains and tend to use information sources that are not consistent across tasks. We propose Reinforced Active Learning (RAL) to train a Reinforcement Learning policy that utilizes many different aspects of the data and the task in order to select the most informative unlabeled subset dynamically over the course of the AL procedure. We demonstrate the superior performance of the proposed RAL framework compared to strong AL baselines across four intricate multi-class, multi-label text classification datasets taken from specialised domains. In addition, we experiment with a unique data augmentation approach to further reduce the number of samples RAL needs to annotate.

2022

pdf abs
Evaluating Pre-Trained Sentence-BERT with Class Embeddings in Active Learning for Multi-Label Text Classification
Lukas Wertz | Jasmina Bogojeska | Katsiaryna Mirylenka | Jonas Kuhn
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The Transformer Language Model is a powerful tool that has been shown to excel at various NLP tasks and has become the de-facto standard solution thanks to its versatility. In this study, we employ pre-trained document embeddings in an Active Learning task to group samples with the same labels in the embedding space on a legal document corpus. We find that the calculated class embeddings are not close to the respective samples and consequently do not partition the embedding space in a meaningful way. In addition, we explore using the class embeddings as an Active Learning strategy with dramatically reduced results compared to all baselines.

pdf abs
Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification
Lukas Wertz | Katsiaryna Mirylenka | Jonas Kuhn | Jasmina Bogojeska
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Large scale, multi-label text datasets with high numbers of different classes are expensive to annotate, even more so if they deal with domain specific language. In this work, we aim to build classifiers on these datasets using Active Learning in order to reduce the labeling effort. We outline the challenges when dealing with extreme multi-label settings and show the limitations of existing Active Learning strategies by focusing on their effectiveness as well as efficiency in terms of computational cost. In addition, we present five multi-label datasets which were compiled from hierarchical classification tasks to serve as benchmarks in the context of extreme multi-label classification for future experiments. Finally, we provide insight into multi-class, multi-label evaluation and present an improved classifier architecture on top of pre-trained transformer language models.

We present the steps taken towards an exploration platform for a multi-modal corpus of German lyric poetry from the Romantic era developed in the project »textklang«. This interdisciplinary project develops a mixed-methods approach for the systematic investigation of the relationship between written text (here lyric poetry) and its potential and actual sonic realisation (in recitations, musical performances etc.). The multi-modal »textklang« platform will be designed to technically and analytically combine three modalities: the poetic text, the audio signal of a recorded recitation and, at a later stage, music scores of a musical setting of a poem. The methodological workflow will enable scholars to develop hypotheses about the relationship between textual form and sonic/prosodic realisation based on theoretical considerations, text interpretation and evidence from recorded recitations. The full workflow will support hypothesis testing either through systematic corpus analysis alone or with addtional contrastive perception experiments. For the experimental track, researchers will be enabled to manipulate prosodic parameters in (re-)synthesised variants of the original recordings. The focus of this paper is on the design of the base corpus and on tools for systematic exploration – placing special emphasis on our response to challenges stemming from multi-modality and the methodologically diverse interdisciplinary setup.

Many tasks in text-based computational social science (CSS) involve the classification of political statements into categories based on a domain-specific codebook. In order to be useful for CSS analysis, these categories must be fine-grained. The typically skewed distribution of fine-grained categories, however, results in a challenging classification problem on the NLP side. This paper proposes to make use of the hierarchical relations among categories typically present in such codebooks:e.g., markets and taxation are both subcategories of economy, while borders is a subcategory of security. We use these ontological relations as prior knowledge to establish additional constraints on the learned model, thusimproving performance overall and in particular for infrequent categories. We evaluate several lightweight variants of this intuition by extending state-of-the-art transformer-based textclassifiers on two datasets and multiple languages. We find the most consistent improvement for an approach based on regularization.

2021

pdf abs
Negation-Instance Based Evaluation of End-to-End Negation Resolution
Elizaveta Sineva | Stefan Grünewald | Annemarie Friedrich | Jonas Kuhn
Proceedings of the 25th Conference on Computational Natural Language Learning

In this paper, we revisit the task of negation resolution, which includes the subtasks of cue detection (e.g. “not”, “never”) and scope resolution. In the context of previous shared tasks, a variety of evaluation metrics have been proposed. Subsequent works usually use different subsets of these, including variations and custom implementations, rendering meaningful comparisons between systems difficult. Examining the problem both from a linguistic perspective and from a downstream viewpoint, we here argue for a negation-instance based approach to evaluating negation resolution. Our proposed metrics correspond to expectations over per-instance scores and hence are intuitively interpretable. To render research comparable and to foster future work, we provide results for a set of current state-of-the-art systems for negation resolution on three English corpora, and make our implementation of the evaluation scripts publicly available.

pdf abs
Explaining and Improving BERT Performance on Lexical Semantic Change Detection
Severin Laicher | Sinan Kurtyigit | Dominik Schlechtweg | Jonas Kuhn | Sabine Schulte im Walde
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Type- and token-based embedding architectures are still competing in lexical semantic change detection. The recent success of type-based models in SemEval-2020 Task 1 has raised the question why the success of token-based models on a variety of other NLP tasks does not translate to our field. We investigate the influence of a range of variables on clusterings of BERT vectors and show that its low performance is largely due to orthographic information on the target word, which is encoded even in the higher layers of BERT representations. By reducing the influence of orthography we considerably improve BERT’s performance.

The analysis of public debates crucially requires the classification of political demands according to hierarchical claim ontologies (e.g. for immigration, a supercategory “Controlling Migration” might have subcategories “Asylum limit” or “Border installations”). A major challenge for automatic claim classification is the large number and low frequency of such subclasses. We address it by jointly predicting pairs of matching super- and subcategories. We operationalize this idea by (a) encoding soft constraints in the claim classifier and (b) imposing hard constraints via Integer Linear Programming. Our experiments with different claim classifiers on a German immigration newspaper corpus show consistent performance increases for joint prediction, in particular for infrequent categories and discuss the complementarity of the two approaches.

pdf
WordGuess: Using Associations for Guessing, Learning and Exploring Related Words
Cennet Oguz | André Blessing | Jonas Kuhn | Sabine Schulte Im Walde
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf abs
Modeling Sense Structure in Word Usage Graphs with the Weighted Stochastic Block Model
Dominik Schlechtweg | Enrique Castaneda | Jonas Kuhn | Sabine Schulte im Walde
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

We suggest to model human-annotated Word Usage Graphs capturing fine-grained semantic proximity distinctions between word uses with a Bayesian formulation of the Weighted Stochastic Block Model, a generative model for random graphs popular in biology, physics and social sciences. By providing a probabilistic model of graded word meaning we aim to approach the slippery and yet widely used notion of word sense in a novel way. The proposed framework enables us to rigorously compare models of word senses with respect to their fit to the data. We perform extensive experiments and select the empirically most adequate model.

pdf abs
Compound or Term Features? Analyzing Salience in Predicting the Difficulty of German Noun Compounds across Domains
Anna Hätty | Julia Bettinger | Michael Dorna | Jonas Kuhn | Sabine Schulte im Walde
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Predicting the difficulty of domain-specific vocabulary is an important task towards a better understanding of a domain, and to enhance the communication between lay people and experts. We investigate German closed noun compounds and focus on the interaction of compound-based lexical features (such as frequency and productivity) and terminology-based features (contrasting domain-specific and general language) across word representations and classifiers. Our prediction experiments complement insights from classification using (a) manually designed features to characterise termhood and compound formation and (b) compound and constituent word embeddings. We find that for a broad binary distinction into ‘easy’ vs. ‘difficult’ general-language compound frequency is sufficient, but for a more fine-grained four-class distinction it is crucial to include contrastive termhood features and compound and constituent features.

pdf abs
Applying Occam’s Razor to Transformer-Based Dependency Parsing: What Works, What Doesn’t, and What is Really Necessary
Stefan Grünewald | Annemarie Friedrich | Jonas Kuhn
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

The introduction of pre-trained transformer-based contextualized word embeddings has led to considerable improvements in the accuracy of graph-based parsers for frameworks such as Universal Dependencies (UD). However, previous works differ in various dimensions, including their choice of pre-trained language models and whether they use LSTM layers. With the aims of disentangling the effects of these choices and identifying a simple yet widely applicable architecture, we introduce STEPS, a new modular graph-based dependency parser. Using STEPS, we perform a series of analyses on the UD corpora of a diverse set of languages. We find that the choice of pre-trained embeddings has by far the greatest impact on parser performance and identify XLM-R as a robust choice across the languages in our study. Adding LSTM layers provides no benefits when using transformer-based embeddings. A multi-task training setup outputting additional UD features may contort results. Taking these insights together, we propose a simple but widely applicable parser architecture and configuration, achieving new state-of-the-art results (in terms of LAS) for 10 out of 12 diverse languages.

pdf abs
Lexical Semantic Change Discovery
Sinan Kurtyigit | Maike Park | Dominik Schlechtweg | Jonas Kuhn | Sabine Schulte im Walde
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

While there is a large amount of research in the field of Lexical Semantic Change Detection, only few approaches go beyond a standard benchmark evaluation of existing models. In this paper, we propose a shift of focus from change detection to change discovery, i.e., discovering novel word senses over time from the full corpus vocabulary. By heavily fine-tuning a type-based and a token-based approach on recently published German data, we demonstrate that both models can successfully be applied to discover new words undergoing meaning change. Furthermore, we provide an almost fully automated framework for both evaluation and discovery.

2020

pdf abs
Ensemble Self-Training for Low-Resource Languages: Grapheme-to-Phoneme Conversion and Morphological Inflection
Xiang Yu | Ngoc Thang Vu | Jonas Kuhn
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

We present an iterative data augmentation framework, which trains and searches for an optimal ensemble and simultaneously annotates new training data in a self-training style. We apply this framework on two SIGMORPHON 2020 shared tasks: grapheme-to-phoneme conversion and morphological inflection. With very simple base models in the ensemble, we rank the first and the fourth in these two tasks. We show in the analysis that our system works especially well on low-resource languages.

DEbateNet-migr15 is a manually annotated dataset for German which covers the public debate on immigration in 2015. The building block of our annotation is the political science notion of a claim, i.e., a statement made by a political actor (a politician, a party, or a group of citizens) that a specific action should be taken (e.g., vacant flats should be assigned to refugees). We identify claims in newspaper articles, assign them to actors and fine-grained categories and annotate their polarity and date. The aim of this paper is two-fold: first, we release the full DEbateNet-mig15 corpus and document it by means of a quantitative and qualitative analysis; second, we demonstrate its application in a discourse network analysis framework, which enables us to capture the temporal dynamics of the political debate

We present GRAIN-S, a set of manually created syntactic annotations for radio interviews in German. The dataset extends an existing corpus GRAIN and comes with constituency and dependency trees for six interviews. The rare combination of gold- and silver-standard annotation layers coming from GRAIN with high-quality syntax trees can serve as a useful resource for speech- and text-based research. Moreover, since interviews can be put between carefully prepared speech and spontaneous conversational speech, they cover phenomena not seen in traditional newspaper-based treebanks. Therefore, GRAIN-S can contribute to research into techniques for model adaptation and for building more corpus-independent tools. GRAIN-S follows TIGER, one of the established syntactic treebanks of German. We describe the annotation process and discuss decisions necessary to adapt the original TIGER guidelines to the interviews domain. Next, we give details on the conversion from TIGER-style trees to dependency trees. We provide data statistics and demonstrate differences between the new dataset and existing out-of-domain test sets annotated with TIGER syntactic structures. Finally, we provide baseline parsing results for further comparison.

pdf abs
CCOHA: Clean Corpus of Historical American English
Reem Alatrash | Dominik Schlechtweg | Jonas Kuhn | Sabine Schulte im Walde
Proceedings of the Twelfth Language Resources and Evaluation Conference

Modelling language change is an increasingly important area of interest within the fields of sociolinguistics and historical linguistics. In recent years, there has been a growing number of publications whose main concern is studying changes that have occurred within the past centuries. The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. This paper describes methods applied to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting corpus CCOHA contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.

pdf abs
IMSurReal Too: IMS in the Surface Realization Shared Task 2020
Xiang Yu | Simon Tannert | Ngoc Thang Vu | Jonas Kuhn
Proceedings of the Third Workshop on Multilingual Surface Realisation

We introduce the IMS contribution to the Surface Realization Shared Task 2020. The new system achieves substantial improvement over the state-of-the-art system from last year, mainly due to a better token representation and a better linearizer, as well as a simple ensembling approach. We also experiment with data augmentation, which brings some additional performance gain. The system is available at https://github.com/EggplantElf/IMSurReal.

pdf abs
Integrating Graph-Based and Transition-Based Dependency Parsers in the Deep Contextualized Era
Agnieszka Falenska | Anders Björkelund | Jonas Kuhn
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

Graph-based and transition-based dependency parsers used to have different strengths and weaknesses. Therefore, combining the outputs of parsers from both paradigms used to be the standard approach to improve or analyze their performance. However, with the recent adoption of deep contextualized word representations, the chief weakness of graph-based models, i.e., their limited scope of features, has been mitigated. Through two popular combination techniques – blending and stacking – we demonstrate that the remaining diversity in the parsing models is reduced below the level of models trained with different random seeds. Thus, an integration no longer leads to increased accuracy. When both parsers depend on BiLSTMs, the graph-based architecture has a consistent advantage. This advantage stems from globally-trained BiLSTM representations, which capture more distant look-ahead syntactic relations. Such representations can be exploited through multi-task learning, which improves the transition-based parser, especially on treebanks with a high ratio of right-headed dependencies.

pdf abs
Fast and Accurate Non-Projective Dependency Tree Linearization
Xiang Yu | Simon Tannert | Ngoc Thang Vu | Jonas Kuhn
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We propose a graph-based method to tackle the dependency tree linearization task. We formulate the task as a Traveling Salesman Problem (TSP), and use a biaffine attention model to calculate the edge costs. We facilitate the decoding by solving the TSP for each subtree and combining the solution into a projective tree. We then design a transition system as post-processing, inspired by non-projective transition-based parsing, to obtain non-projective sentences. Our proposed method outperforms the state-of-the-art linearizer while being 10 times faster in training and decoding.

pdf abs
Real-Valued Logics for Typological Universals: Framework and Application
Tillmann Dönicke | Xiang Yu | Jonas Kuhn
Proceedings of the 28th International Conference on Computational Linguistics

This paper proposes a framework for the expression of typological statements which uses real-valued logics to capture the empirical truth value (truth degree) of a formula on a given data source, e.g. a collection of multilingual treebanks with comparable annotation. The formulae can be arbitrarily complex expressions of propositional logic. To illustrate the usefulness of such a framework, we present experiments on the Universal Dependencies treebanks for two use cases: (i) empirical (re-)evaluation of established formulae against the spectrum of available treebanks and (ii) evaluating new formulae (i.e. potential candidates for universals) generated by a search algorithm.

pdf abs
Identifying and Handling Cross-Treebank Inconsistencies in UD: A Pilot Study
Tillmann Dönicke | Xiang Yu | Jonas Kuhn
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

The Universal Dependencies treebanks are a still-growing collection of treebanks for a wide range of languages, all annotated with a common inventory of dependency relations. Yet, the usages of the relations can be categorically different even for treebanks of the same language. We present a pilot study on identifying such inconsistencies in a language-independent way and conduct an experiment which illustrates that a proper handling of inconsistencies can improve parsing performance by several percentage points.

2019

pdf abs
Learning the Dyck Language with Attention-based Seq2Seq Models
Xiang Yu | Ngoc Thang Vu | Jonas Kuhn
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

The generalized Dyck language has been used to analyze the ability of Recurrent Neural Networks (RNNs) to learn context-free grammars (CFGs). Recent studies draw conflicting conclusions on their performance, especially regarding the generalizability of the models with respect to the depth of recursion. In this paper, we revisit several common models and experimental settings, discuss the potential problems of the tasks and analyses. Furthermore, we explore the use of attention mechanisms within the seq2seq framework to learn the Dyck language, which could compensate for the limited encoding ability of RNNs. Our findings reveal that attention mechanisms still cannot truly generalize over the recursion depth, although they perform much better than other models on the closing bracket tagging task. Moreover, this also suggests that this commonly used task is not sufficient to test a model’s understanding of CFGs.

pdf
Dependency Length Minimization vs. Word Order Constraints: An Empirical Study On 55 Treebanks
Xiang Yu | Agnieszka Falenska | Jonas Kuhn
Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019)

pdf abs
Head-First Linearization with Tree-Structured Representation
Xiang Yu | Agnieszka Falenska | Ngoc Thang Vu | Jonas Kuhn
Proceedings of the 12th International Conference on Natural Language Generation

We present a dependency tree linearization model with two novel components: (1) a tree-structured encoder based on bidirectional Tree-LSTM that propagates information first bottom-up then top-down, which allows each token to access information from the entire tree; and (2) a linguistically motivated head-first decoder that emphasizes the central role of the head and linearizes the subtree by incrementally attaching the dependents on both sides of the head. With the new encoder and decoder, we reach state-of-the-art performance on the Surface Realization Shared Task 2018 dataset, outperforming not only the shared tasks participants, but also previous state-of-the-art systems (Bohnet et al., 2011; Puduppully et al., 2016). Furthermore, we analyze the power of the tree-structured encoder with a probing task and show that it is able to recognize the topological relation between any pair of tokens in a tree.

pdf abs
The (Non-)Utility of Structural Features in BiLSTM-based Dependency Parsers
Agnieszka Falenska | Jonas Kuhn
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Classical non-neural dependency parsers put considerable effort on the design of feature functions. Especially, they benefit from information coming from structural features, such as features drawn from neighboring tokens in the dependency tree. In contrast, their BiLSTM-based successors achieve state-of-the-art performance without explicit information about the structural context. In this paper we aim to answer the question: How much structural context are the BiLSTM representations able to capture implicitly? We show that features drawn from partial subtrees become redundant when the BiLSTMs are used. We provide a deep insight into information flow in transition- and graph-based neural architectures to demonstrate where the implicit information comes from when the parsers make their decisions. Finally, with model ablations we demonstrate that the structural context is not only present in the models, but it significantly influences their performance.

pdf abs
Who Sides with Whom? Towards Computational Construction of Discourse Networks for Political Debates
Sebastian Padó | Andre Blessing | Nico Blokker | Erenay Dayanik | Sebastian Haunss | Jonas Kuhn
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Understanding the structures of political debates (which actors make what claims) is essential for understanding democratic political decision making. The vision of computational construction of such discourse networks from newspaper reports brings together political science and natural language processing. This paper presents three contributions towards this goal: (a) a requirements analysis, linking the task to knowledge base population; (b) an annotated pilot corpus of migration claims based on German newspaper reports; (c) initial modeling results.

pdf abs
An Environment for Relational Annotation of Political Debates
Andre Blessing | Nico Blokker | Sebastian Haunss | Jonas Kuhn | Gabriella Lapesa | Sebastian Padó
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This paper describes the MARDY corpus annotation environment developed for a collaboration between political science and computational linguistics. The tool realizes the complete workflow necessary for annotating a large newspaper text collection with rich information about claims (demands) raised by politicians and other actors, including claim and actor spans, relations, and polarities. In addition to the annotation GUI, the tool supports the identification of relevant documents, text pre-processing, user management, integration of external knowledge bases, annotation comparison and merging, statistical analysis, and the incorporation of machine learning models as “pseudo-annotators”.

pdf abs
IMSurReal: IMS at the Surface Realization Shared Task 2019
Xiang Yu | Agnieszka Falenska | Marina Haid | Ngoc Thang Vu | Jonas Kuhn
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

We introduce the IMS contribution to the Surface Realization Shared Task 2019. Our submission achieves the state-of-the-art performance without using any external resources. The system takes a pipeline approach consisting of five steps: linearization, completion, inflection, contraction, and detokenization. We compare the performance of our linearization algorithm with two external baselines and report results for each step in the pipeline. Furthermore, we perform detailed error analysis revealing correlation between word order freedom and difficulty of the linearization task.

2018

pdf abs
Polyglot Semantic Parsing in APIs
Kyle Richardson | Jonathan Berant | Jonas Kuhn
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017b,a). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks.

pdf abs
Supervised Rhyme Detection with Siamese Recurrent Networks
Thomas Haider | Jonas Kuhn
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We present the first supervised approach to rhyme detection with Siamese Recurrent Networks (SRN) that offer near perfect performance (97% accuracy) with a single model on rhyme pairs for German, English and French, allowing future large scale analyses. SRNs learn a similarity metric on variable length character sequences that can be used as judgement on the distance of imperfect rhyme pairs and for binary classification. For training, we construct a diachronically balanced rhyme goldstandard of New High German (NHG) poetry. For further testing, we sample a second collection of NHG poetry and set of contemporary Hip-Hop lyrics, annotated for rhyme and assonance. We train several high-performing SRN models and evaluate them qualitatively on selected sonnetts.

pdf abs
Approximate Dynamic Oracle for Dependency Parsing with Reinforcement Learning
Xiang Yu | Ngoc Thang Vu | Jonas Kuhn
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

We present a general approach with reinforcement learning (RL) to approximate dynamic oracles for transition systems where exact dynamic oracles are difficult to derive. We treat oracle parsing as a reinforcement learning problem, design the reward function inspired by the classical dynamic oracle, and use Deep Q-Learning (DQN) techniques to train the oracle with gold trees as features. The combination of a priori knowledge and data-driven methods enables an efficient dynamic oracle, which improves the parser performance over static oracles in several transition systems.

pdf abs
Bridging resolution: Task definition, corpus resources and rule-based experiments
Ina Roesiger | Arndt Riester | Jonas Kuhn
Proceedings of the 27th International Conference on Computational Linguistics

Recent work on bridging resolution has so far been based on the corpus ISNotes (Markert et al. 2012), as this was the only corpus available with unrestricted bridging annotation. Hou et al. 2014’s rule-based system currently achieves state-of-the-art performance on this corpus, as learning-based approaches suffer from the lack of available training data. Recently, a number of new corpora with bridging annotations have become available. To test the generalisability of the approach by Hou et al. 2014, we apply a slightly extended rule-based system to these corpora. Besides the expected out-of-domain effects, we also observe low performance on some of the in-domain corpora. Our analysis shows that this is the result of two very different phenomena being defined as bridging, namely referential and lexical bridging. We also report that filtering out gold or predicted coreferent anaphors before applying the bridging resolution system helps improve bridging resolution.

Today, we see an ever growing number of tools supporting text annotation. Each of these tools is optimized for specific use-cases such as named entity recognition. However, we see large growing knowledge bases such as Wikipedia or the Google Knowledge Graph. In this paper, we introduce NLATool, a web application developed using a human-centered design process. The application combines supporting text annotation and enriching the text with additional information from a number of sources directly within the application. The tool assists users to efficiently recognize named entities, annotate text, and automatically provide users additional information while solving deep text understanding tasks.

pdf
A Lightweight Modeling Middleware for Corpus Processing
Markus Gärtner | Jonas Kuhn
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Moving TIGER beyond Sentence-Level
Agnieszka Falenska | Kerstin Eckart | Jonas Kuhn
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
IMS at the CoNLL 2017 UD Shared Task: CRFs and Perceptrons Meet Neural Networks
Anders Björkelund | Agnieszka Falenska | Xiang Yu | Jonas Kuhn
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper presents the IMS contribution to the CoNLL 2017 Shared Task. In the preprocessing step we employed a CRF POS/morphological tagger and a neural tagger predicting supertags. On some languages, we also applied word segmentation with the CRF tagger and sentence segmentation with a perceptron-based parser. For parsing we took an ensemble approach by blending multiple instances of three parsers with very different architectures. Our system achieved the third place overall and the second place for the surprise languages.

pdf abs
Multi-modular domain-tailored OCR post-correction
Sarah Schulz | Jonas Kuhn
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

One of the main obstacles for many Digital Humanities projects is the low data availability. Texts have to be digitized in an expensive and time consuming process whereas Optical Character Recognition (OCR) post-correction is one of the time-critical factors. At the example of OCR post-correction, we show the adaptation of a generic system to solve a specific problem with little data. The system accounts for a diversity of errors encountered in OCRed texts coming from different time periods in the domain of literature. We show that the combination of different approaches, such as e.g. Statistical Machine Translation and spell checking, with the help of a ranking mechanism tremendously improves over single-handed approaches. Since we consider the accessibility of the resulting tool as a crucial part of Digital Humanities collaborations, we describe the workflow we suggest for efficient text recognition and subsequent automatic and manual post-correction

pdf abs
Function Assistant: A Tool for NL Querying of APIs
Kyle Richardson | Jonas Kuhn
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

In this paper, we describe Function Assistant, a lightweight Python-based toolkit for querying and exploring source code repositories using natural language. The toolkit is designed to help end-users of a target API quickly find information about functions through high-level natural language queries, or descriptions. For a given text query and background API, the tool finds candidate functions by performing a translation from the text to known representations in the API using the semantic parsing approach of (Richardson and Kuhn, 2017). Translations are automatically learned from example text-code pairs in example APIs. The toolkit includes features for building translation pipelines and query engines for arbitrary source code projects. To explore this last feature, we perform new experiments on 27 well-known Python projects hosted on Github.

pdf abs
The Code2Text Challenge: Text Generation in Source Libraries
Kyle Richardson | Sina Zarrieß | Jonas Kuhn
Proceedings of the 10th International Conference on Natural Language Generation

We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction, and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets.

pdf abs
Learning Semantic Correspondences in Technical Documentation
Kyle Richardson | Jonas Kuhn
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning translational correspondences between text descriptions and grounded representations in the target documentation, such as formal representation of functions or code templates. Our approach exploits the parallel nature of such documentation, or the tight coupling between high-level text and the low-level representations we aim to learn. Data is collected by mining technical documents for such parallel text-representation pairs, which we use to train a simple semantic parsing model. We report new baseline results on sixteen novel datasets, including the standard library documentation for nine popular programming languages across seven natural languages, and a small collection of Unix utility manuals.

2016

pdf bib abs
Flexible and Reliable Text Analytics in the Digital Humanities – Some Methodological Considerations
Jonas Kuhn
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

The availability of Language Technology Resources and Tools generates a considerable methodological potential in the Digital Humanities: aspects of research questions from the Humanities and Social Sciences can be addressed on text collections in ways that were unavailable to traditional approaches. I start this talk by sketching some sample scenarios of Digital Humanities projects which involve various Humanities and Social Science disciplines, noting that the potential for a meaningful contribution to higher-level questions is highest when the employed language technological models are carefully tailored both (a) to characteristics of the given target corpus, and (b) to relevant analytical subtasks feeding the discipline-specific research questions. Keeping up a multidisciplinary perspective, I then point out a recurrent dilemma in Digital Humanities projects that follow the conventional set-up of collaboration: to build high-quality computational models for the data, fixed analytical targets should be specified as early as possible – but to be able to respond to Humanities questions as they evolve over the course of analysis, the analytical machinery should be kept maximally flexible. To reach both, I argue for a novel collaborative culture that rests on a more interleaved, continuous dialogue. (Re-)Specification of analytical targets should be an ongoing process in which the Humanities Scholars and Social Scientists play a role that is as important as the Computational Scientists’ role. A promising approach lies in the identification of re-occurring types of analytical subtasks, beyond linguistic standard tasks, which can form building blocks for text analysis across disciplines, and for which corpus-based characterizations (viz. annotations) can be collected, compared and revised. On such grounds, computational modeling is more directly tied to the evolving research questions, and hence the seemingly opposing needs of reliable target specifications vs. “malleable” frameworks of analysis can be reconciled. Experimental work following this approach is under way in the Center for Reflected Text Analytics (CRETA) in Stuttgart.

pdf
How to Train Dependency Parsers with Inexact Search for Joint Sentence Boundary Detection and Parsing of Entire Documents
Anders Björkelund | Agnieszka Faleńska | Wolfgang Seeker | Jonas Kuhn
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf abs
Learning to Make Inferences in a Semantic Parsing Task
Kyle Richardson | Jonas Kuhn
Transactions of the Association for Computational Linguistics, Volume 4

We introduce a new approach to training a semantic parser that uses textual entailment judgements as supervision. These judgements are based on high-level inferences about whether the meaning of one sentence follows from another. When applied to an existing semantic parsing task, they prove to be a useful tool for revealing semantic distinctions and background knowledge not captured in the target representations. This information is used to improve the quality of the semantic representations being learned and to acquire generic knowledge for reasoning. Experiments are done on the benchmark Sportscaster corpus (Chen and Mooney, 2008), and a novel RTE-inspired inference dataset is introduced. On this new dataset our method strongly outperforms several strong baselines. Separately, we obtain state-of-the-art results on the original Sportscaster semantic parsing task.

pdf abs
IMS HotCoref DE: A Data-driven Co-reference Resolver for German
Ina Roesiger | Jonas Kuhn
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a data-driven co-reference resolution system for German that has been adapted from IMS HotCoref, a co-reference resolver for English. It describes the difficulties when resolving co-reference in German text, the adaptation process and the features designed to address linguistic challenges brought forth by German. We report performance on the reference dataset TüBa-D/Z and include a post-task SemEval 2010 evaluation, showing that the resolver achieves state-of-the-art performance. We also include ablation experiments that indicate that integrating linguistic features increases results. The paper also describes the steps and the format necessary to use the resolver on new texts. The tool is freely available for download.

pdf abs
Learning from Within? Comparing PoS Tagging Approaches for Historical Text
Sarah Schulz | Jonas Kuhn
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we investigate unsupervised and semi-supervised methods for part-of-speech (PoS) tagging in the context of historical German text. We locate our research in the context of Digital Humanities where the non-canonical nature of text causes issues facing an Natural Language Processing world in which tools are mainly trained on standard data. Data deviating from the norm requires tools adjusted to this data. We explore to which extend the availability of such training material and resources related to it influences the accuracy of PoS tagging. We investigate a variety of algorithms including neural nets, conditional random fields and self-learning techniques in order to find the best-fitted approach to tackle data sparsity. Although methods using resources from related languages outperform weakly supervised methods using just a few training examples, we can still reach a promising accuracy with methods abstaining additional resources.

pdf abs
Named Entity Disambiguation for little known referents: a topic-based approach
Andrea Glaser | Jonas Kuhn
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We propose an approach to Named Entity Disambiguation that avoids a problem of standard work on the task (likewise affecting fully supervised, weakly supervised, or distantly supervised machine learning techniques): the treatment of name mentions referring to people with no (or very little) coverage in the textual training data is systematically incorrect. We propose to indirectly take into account the property information for the “non-prominent” name bearers, such as nationality and profession (e.g., for a Canadian law professor named Michael Jackson, with no Wikipedia article, it is very hard to obtain reliable textual training data). The target property information for the entities is directly available from name authority files, or inferrable, e.g., from listings of sportspeople etc. Our proposed approach employs topic modeling to exploit textual training data based on entities sharing the relevant properties. In experiments with a pilot implementation of the general approach, we show that the approach does indeed work well for name/referent pairs with limited textual coverage in the training data.

2015

pdf
Structural Alignment for Comparison Detection
Wiltrud Kessler | Jonas Kuhn
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
A Pilot Experiment on Exploiting Translations for Literary Studies on Kafka’s “Verwandlung”
Fabienne Cap | Ina Rösiger | Jonas Kuhn
Proceedings of the Fourth Workshop on Computational Linguistics for Literature

pdf
Towards Opinion Mining from Reviews for the Prediction of Product Rankings
Wiltrud Kessler | Roman Klinger | Jonas Kuhn
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf
Multi-modal Visualization and Search for Text and Prosody Annotations
Markus Gärtner | Katrin Schweitzer | Kerstin Eckart | Jonas Kuhn
Proceedings of ACL-IJCNLP 2015 System Demonstrations

2014

pdf
Learning Structured Perceptrons for Coreference Resolution with Latent Antecedents and Non-local Features
Anders Björkelund | Jonas Kuhn
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Visualization, Search, and Error Analysis for Coreference Annotations
Markus Gärtner | Anders Björkelund | Gregor Thiele | Wolfgang Seeker | Jonas Kuhn
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf abs
A Corpus of Comparisons in Product Reviews
Wiltrud Kessler | Jonas Kuhn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sentiment analysis (or opinion mining) deals with the task of determining the polarity of an opinionated document or sentence. Users often express sentiment about one product by comparing it to a different product. In this work, we present a corpus of comparison sentences from English camera reviews. For our purposes we define a comparison to be any statement about the similarity or difference of two entities. For each sentence we have annotated detailed information about the comparisons it contains: The comparative predicate that expresses the comparison, the type of the comparison, the two entities that are being compared, and the aspect they are compared in. The results of our agreement study show that the decision whether a sentence contains a comparison is difficult to make even for trained human annotators. Once that decision is made, we can achieve consistent results for the very detailed annotations. In total, we have annotated 2108 comparisons in 1707 sentences from camera reviews which makes our corpus the largest resource currently available. The corpus and the annotation guidelines are publicly available on our website.

pdf abs
Textual Emigration Analysis (TEA)
Andre Blessing | Jonas Kuhn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a web-based application which is called TEA (Textual Emigration Analysis) as a showcase that applies textual analysis for the humanities. The TEA tool is used to transform raw text input into a graphical display of emigration source and target countries (under a global or an individual perspective). It provides emigration-related frequency information, and gives access to individual textual sources, which can be downloaded by the user. Our application is built on top of the CLARIN infrastructure which targets researchers of the humanities. In our scenario, we focus on historians, literary scientists, and other social scientists that are interested in the semantic interpretation of text. Our application processes a large set of documents to extract information about people who emigrated. The current implementation integrates two data sets: A data set from the Global Migrant Origin Database, which does not need additional processing, and a data set which was extracted from the German Wikipedia edition. The TEA tool can be accessed by using the following URL: http://clarin01.ims.uni-stuttgart.de/geovis/showcase.html

pdf abs
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
Masood Ghayoomi | Jonas Kuhn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

A treebank is an important language resource for supervised statistical parsers. The parser induces the grammatical properties of a language from this language resource and uses the model to parse unseen data automatically. Since developing such a resource is very time-consuming and tedious, one can take advantage of already extant resources by adapting them to a particular application. This reduces the amount of human effort required to develop a new language resource. In this paper, we introduce an algorithm to convert an HPSG-based treebank into its parallel dependency-based treebank. With this converter, we can automatically create a new language resource from an existing treebank developed based on a grammar formalism. Our proposed algorithm is able to create both projective and non-projective dependency trees.

pdf abs
An Out-of-Domain Test Suite for Dependency Parsing of German
Wolfgang Seeker | Jonas Kuhn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a dependency conversion of five German test sets from five different genres. The dependency representation is made as similar as possible to the dependency representation of TiGer, one of the two big syntactic treebanks of German. The purpose of these test sets is to enable researchers to test dependency parsing models on several different data sets from different text genres. We discuss some easy to compute statistics to demonstrate the variation and differences in the test sets and provide some baseline experiments where we test the effect of additional lexical knowledge on the out-of-domain performance of two state-of-the-art dependency parsers. Finally, we demonstrate with three small experiments that text normalization may be an important step in the standard processing pipeline when applied in an out-of-domain setting.

pdf abs
UnixMan Corpus: A Resource for Language Learning in the Unix Domain
Kyle Richardson | Jonas Kuhn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a new resource, the UnixMan Corpus, for studying language learning it the domain of Unix utility manuals. The corpus is built by mining Unix (and other Unix related) man pages for parallel example entries, consisting of English textual descriptions with corresponding command examples. The commands provide a grounded and ambiguous semantics for the textual descriptions, making the corpus of interest to work on Semantic Parsing and Grounded Language Learning. In contrast to standard resources for Semantic Parsing, which tend to be restricted to a small number of concepts and relations, the UnixMan Corpus spans a wide variety of utility genres and topics, and consists of hundreds of command and domain entity types. The semi-structured nature of the manuals also makes it easy to exploit other types of relevant information for Grounded Language Learning. We describe the details of the corpus and provide preliminary classification results.

pdf abs
Exploring the utility of coreference chains for improved identification of personal names
Andrea Glaser | Jonas Kuhn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Identifying the real world entity that a proper name refers to is an important task in many NLP applications. Context plays an important role in disambiguating entities with the same names. In this paper, we discuss a dataset and experimental set-up that allows us to systematically explore the effects of different sizes and types of context in this disambiguation task. We create context by first identifying coreferent expressions in the document and then combining sentences these expressions occur in to one informative context. We apply different filters to obtain different levels of coreference-based context. Since hand-labeling a dataset of a decent size is expensive, we investigate the usefulness of an automatically created pseudo-ambiguity dataset. The results on this pseudo-ambiguity dataset show that using coreference-based context performs better than using a fixed window of context around the entity. The insights taken from the pseudo data experiments can be used to predict how the method works with real data. In our experiments on real data we obtain comparable results.

pdf
A Graphical Interface for Automatic Error Mining in Corpora
Gregor Thiele | Wolfgang Seeker | Markus Gärtner | Anders Björkelund | Jonas Kuhn
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf
Towards a Tool for Interactive Concept Building for Large Scale Analysis in the Humanities
Andre Blessing | Jonathan Sonntag | Fritz Kliche | Ulrich Heid | Jonas Kuhn | Manfred Stede
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf
Towards Joint Morphological Analysis and Dependency Parsing of Turkish
Özlem Çetinoğlu | Jonas Kuhn
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf
The Effects of Syntactic Features in Automatic Prediction of Morphology
Wolfgang Seeker | Jonas Kuhn
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf
Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?
Wiltrud Kessler | Jonas Kuhn
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf
Morphological and Syntactic Case in Statistical Dependency Parsing
Wolfgang Seeker | Jonas Kuhn
Computational Linguistics, Volume 39, Issue 1 - March 2013

pdf
Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures
Sina Zarrieß | Jonas Kuhn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
ICARUS – An Extensible Graphical Search Tool for Dependency Treebanks
Markus Gärtner | Gregor Thiele | Wolfgang Seeker | Anders Björkelund | Jonas Kuhn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf
Comparing Non-projective Strategies for Labeled Graph-Based Dependency Parsing
Anders Björkelund | Jonas Kuhn
Proceedings of COLING 2012: Posters

pdf
Phrase Structures and Dependencies for End-to-End Coreference Resolution
Anders Björkelund | Jonas Kuhn
Proceedings of COLING 2012: Posters

pdf
Light Textual Inference for Semantic Parsing
Kyle Richardson | Jonas Kuhn
Proceedings of COLING 2012: Posters

pdf
Data-driven Dependency Parsing With Empty Heads
Wolfgang Seeker | Richárd Farkas | Bernd Bohnet | Helmut Schmid | Jonas Kuhn
Proceedings of COLING 2012: Posters

pdf abs
Making Ellipses Explicit in Dependency Conversion for a German Treebank
Wolfgang Seeker | Jonas Kuhn
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a carefully designed dependency conversion of the German phrase-structure treebank TiGer that explicitly represents verb ellipses by introducing empty nodes into the tree. Although the conversion process uses heuristics like many other conversion tools we designed them to fail if no reasonable solution can be found. The failing of the conversion process makes it possible to detect elliptical constructions where the head is missing, but it also allows us to find errors in the original annotation. We discuss the conversion process and the heuristics, and describe some design decisions and error corrections that we applied to the corpus. Since most of today's data-driven dependency parsers are not able to handle empty nodes directly during parsing, our conversion tool also derives a canonical dependency format without empty nodes. It is shown experimentally to be well suited for training statistical dependency parsers by comparing the performance of two parsers from different parsing paradigms on the data set of the CoNLL 2009 Shared Task data and our corpus.

pdf abs
A Corpus-based Study of the German Recipient Passive
Patrick Ziering | Sina Zarrieß | Jonas Kuhn
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we investigate the usage of a non-canonical German passive alternation for ditransitive verbs, the recipient passive, in naturally occuring corpus data. We propose a classifier that predicts the voice of a ditransitive verb based on the contextually determined properties its arguments. As the recipient passive is a low frequent phenomenon, we first create a special data set focussing on German ditransitive verbs which are frequently used in the recipient passive. We use a broad-coverage grammar-based parser, the German LFG parser, to automatically annotate our data set for the morpho-syntactic properties of the involved predicate arguments. We train a Maximum Entropy classifier on the automatically annotated sentences and achieve an accuracy of 98.05%, clearly outperforming the baseline that always predicts active voice baseline (94.6%).

pdf
Generating Non-Projective Word Order in Statistical Linearization
Bernd Bohnet | Anders Björkelund | Jonas Kuhn | Wolfgang Seeker | Sina Zarriess
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf
The Best of Both Worlds – A Graph-based Completion Model for Transition-based Parsers
Bernd Bohnet | Jonas Kuhn
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf
To what extent does sentence-internal realisation reflect discourse context? A study on word order
Sina Zarrieß | Aoife Cahill | Jonas Kuhn
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf
On the Role of Explicit Morphological Feature Representation in Syntactic Dependency Parsing for German
Wolfgang Seeker | Jonas Kuhn
Proceedings of the 12th International Conference on Parsing Technologies

pdf
Underspecifying and Predicting Voice for Surface Realisation Ranking
Sina Zarrieß | Aoife Cahill | Jonas Kuhn
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf abs
Towards a Large Parallel Corpus of Cleft Constructions
Gerlof Bouma | Lilja Øvrelid | Jonas Kuhn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present our efforts to create a large-scale, semi-automatically annotated parallel corpus of cleft constructions. The corpus is intended to reduce or make more effective the manual task of finding examples of clefts in a corpus. The corpus is being developed in the context of the Collaborative Research Centre SFB 632, which is a large, interdisciplinary research initiative to study information structure, at the University of Potsdam and the Humboldt University in Berlin. The corpus is based on the Europarl corpus (version 3). We show how state-of-the-art NLP tools, like POS taggers and statistical dependency parsers, may facilitate powerful and precise searches. We argue that identifying clefts using automatically added syntactic structure annotation is ultimately to be preferred over using lower level, though more robust, extraction methods like regular expression matching. An evaluation of the extraction method for one of the languages also offers some support for this method. We end the paper by discussing the resulting corpus itself. We present some examples of interesting clefts and translational counterparts from the corpus and suggest ways of exploiting our newly created resource in the cross-linguistic study of clefts.

pdf abs
Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)
Cheikh M. Bamba Dione | Jonas Kuhn | Sina Zarrieß
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we report on the design of a part-of-speech-tagset for Wolof and on the creation of a semi-automatically annotated gold standard. In order to achieve high-quality annotation relatively fast, we first generated an accurate lexicon that draws on existing word and name lists and takes into account inflectional and derivational morphology. The main motivation for the tagged corpus is to obtain data for training automatic taggers with machine learning approaches. Hence, we took machine learning considerations into account during tagset design and we present training experiments as part of this paper. The best automatic tagger achieves an accuracy of 95.2% in cross-validation experiments. We also wanted to create a basis for experimenting with annotation projection techniques, which exploit parallel corpora. For this reason, it was useful to use a part of the Bible as the gold standard corpus, for which sentence-aligned parallel versions in many languages are easy to obtain. We also report on preliminary experiments exploiting a statistical word alignment of the parallel text.

pdf abs
Training Parsers on Partial Trees: A Cross-language Comparison
Kathrin Spreyer | Lilja Øvrelid | Jonas Kuhn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a study that compares data-driven dependency parsers obtained by means of annotation projection between language pairs of varying structural similarity. We show how the partial dependency trees projected from English to Dutch, Italian and German can be exploited to train parsers for the target languages. We evaluate the parsers against manual gold standard annotations and find that the projected parsers substantially outperform our heuristic baselines by 9―25% UAS, which corresponds to a 21―43% reduction in error rate. A comparative error analysis focuses on how the projected target language parsers handle subjects, which is especially interesting for Italian as an instance of a pro-drop language. For Dutch, we further present experiments with German as an alternative source language. In both source languages, we contrast standard baseline parsers with parsers that are enhanced with the predictions from large-scale LFG grammars through a technique of parser stacking, and show that improvements of the source language parser can directly lead to similar improvements of the projected target language parser.

pdf
Hard Constraints for Grammatical Function Labelling
Wolfgang Seeker | Ines Rehbein | Jonas Kuhn | Josef van Genabith
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
A Cross-Lingual Induction Technique for German Adverbial Participles
Sina Zarrieß | Aoife Cahill | Jonas Kuhn | Christian Rohrer
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

pdf
Informed ways of improving data-driven dependency parsing for German
Wolfgang Seeker | Bernd Bohnet | Lilja Øvrelid | Jonas Kuhn
Coling 2010: Posters

pdf
Cross-Lingual Induction for Deep Broad-Coverage Syntax: A Case Study on German Participles
Sina Zarrieß | Aoife Cahill | Jonas Kuhn | Christian Rohrer
Coling 2010: Posters

2009

pdf
Improving data-driven dependency parsing using large-scale LFG grammars
Lilja Øvrelid | Jonas Kuhn | Kathrin Spreyer
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf
Cross-framework parser stacking for data-driven dependency parsing
Lilja Øvrelid | Jonas Kuhn | Kathrin Spreyer
Traitement Automatique des Langues, Volume 50, Numéro 3 : Apprentissage automatique pour le TAL [Machine Learning for NLP]

pdf
Data-Driven Dependency Parsing of New Languages Using Incomplete and Noisy Training Data
Kathrin Spreyer | Jonas Kuhn
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

pdf
Empirical Lower Bounds on Aligment Error Rates in Syntax-Based Machine Translation
Anders Søgaard | Jonas Kuhn
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

pdf
Exploiting Translational Correspondences for Pattern-Independent MWE Identification
Sina Zarrieß | Jonas Kuhn
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

pdf
Using a maximum entropy-based tagger to improve a very fast vine parser
Anders Søgaard | Jonas Kuhn
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf abs
Identification of Comparable Argument-Head Relations in Parallel Corpora
Kathrin Spreyer | Jonas Kuhn | Bettina Schrader
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present the machine learning framework that we are developing, in order to support explorative search for non-trivial linguistic configurations in low-density languages (languages with no or few NLP tools). The approach exploits advanced existing analysis tools for high-density languages and word-aligned multi-parallel corpora to bridge across languages. The goal is to find a methodology that minimizes the amount of human expert intervention needed, while producing high-quality search and annotation tools. One of the main challenges is the susceptibility of a complex system combining various automatic analysis components to hard-to-control noise from a number of sources. We present systematic experiments investigating to what degree the noise issue can be overcome by (i) exploiting more than one perspective on the target language data by considering multiple translations in the parallel corpus, and (ii) using minimally supervised learning techniques such as co-training and self-training to take advantage of a larger pool of data for generalization. We observe that while (i) does help in the training individual machine learning models, a cyclic bootstrapping process seems to suffer too much from noise. A preliminary conclusion is that in a practical approach, one has to rely on a higher degree of supervision or on noise detection heuristics.

2007

pdf
Machine Translation as Tree Labeling
Mark Hopkins | Jonas Kuhn
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf
Deep Grammars in a Tree Labeling Approach to Syntax-based Statistical Machine Translation
Mark Hopkins | Jonas Kuhn
ACL 2007 Workshop on Deep Linguistic Processing

2006

pdf abs
Multilingual parallel treebanking: a lean and flexible approach
Jonas Kuhn | Michael Jellinghaus
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We propose a bootstrapping approach to creating a phrase-level alignment over a sentence-aligned parallel corpus, reporting concrete treebank annotation work performed on a sample of sentence tuples from the Europarl corpus, currently for English, French, German, and Spanish. The manually annotated seed data will be used as the basis for automatically labelling the rest of the corpus. Some preliminary experiments addressing the bootstrapping aspects are presented. The representation format for syntactic correspondence across parallel text that we propose as the starting point for a process of successive refinement emphasizes correspondences of major constituents that realize semantic arguments or modifiers; language-particular details of morphosyntactic realization are intentionally left largely unlabelled. We believe this format is a good basis for training NLPtools for multilingual application contexts in which consistency across languages is more central than fine-grained details in specific languages (in particular, syntax-based statistical Machine Translation).

pdf
Exploring the Potential of Intractable Parsers
Mark Hopkins | Jonas Kuhn
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
A Framework for Incorporating Alignment Information in Parsing
Mark Hopkins | Jonas Kuhn
Proceedings of the Cross-Language Knowledge Induction Workshop

2005

pdf
Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context
Jonas Kuhn
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf
Experiments in parallel-text based grammar induction
Jonas Kuhn
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf abs
Utilization of Multiple Language Resources for Robust Grammar-Based Tense and Aspect Classification
Alexis Palmer | Jonas Kuhn | Carlota Smith
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper reports on an ongoing project that uses varied language resources and advanced NLP tools for a linguistic classification task in discourse semantics. The system we present is designed to assign a "situation entity" class label to each predicator in English text. The project goal is to achieve the best-possible identification of situation entities in naturally-occurring written texts by implementing a robust system that will deal with real corpus material, rather than just with constructed textbook examples of discourse. In this paper we focus on the combination of multiple information sources, which we see as being vital for a robust classification system. We use a deep syntactic grammar of English to identify morphological, syntactic, and discourse clues, and we use various lexical databases for fine-grained semantic properties of the predicators. Experiments performed to date show that enhancing the output of the grammar with information from lexical resources improves recall but lowers precision in the situation entity classification task.

pdf abs
Applying Computational Linguistic Techniques in a Documentary Project for Q’anjob’al (Mayan, Guatemala)
Jonas Kuhn | B’alam Mateo-Toledo
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper reports on a number of experiments in which we applied standard techniques from NLP in the context of documentation of endangered languages. We concentrated on the use of existing, freely available toolkits. Specifically, we explore the use of Finite-State Morphological Analysis, Maximum Entropy Part-of-Speech Tagging, and N-Gram Language Modeling.