Ulrich Heid

2019

pdf abs
Detecting Paraphrases of Standard Clause Titles in Insurance Contracts
Frieda Josi | Christian Wartena | Ulrich Heid
RELATIONS - Workshop on meaning relations between phrases and sentences

For the analysis of contract texts, validated model texts, such as model clauses, can be used to identify reused contract clauses. This paper investigates how to calculate the similarity between titles of model clauses and headings extracted from contracts, and which similarity measure is most suitable for this. For the calculation of the similarities between title pairs we tested various variants of string similarity and token based similarity. We also compare two more semantic similarity measures based on word embeddings using pretrained embeddings and word embeddings trained on contract texts. The identification of the model clause title can be used as a starting point for the mapping of clauses found in contracts to verified clauses.

2017

pdf bib
Creating a gold standard corpus for terminological annotation from online forum data
Anna Hätty | Simon Tannert | Ulrich Heid
Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017)

2016

pdf abs
Acquisition of semantic relations between terms: how far can we get with standard NLP tools?
Ina Roesiger | Julia Bettinger | Johannes Schäfer | Michael Dorna | Ulrich Heid
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)

The extraction of data exemplifying relations between terms can make use, at least to a large extent, of techniques that are similar to those used in standard hybrid term candidate extraction, namely basic corpus analysis tools (e.g. tagging, lemmatization, parsing), as well as morphological analysis of complex words (compounds and derived items). In this article, we discuss the use of such techniques for the extraction of raw material for a description of relations between terms, and we provide internal evaluation data for the devices developed. We claim that user-generated content is a rich source of term variation through paraphrasing and reformulation, and that these provide relational data at the same time as term variants. Germanic languages with their rich word formation morphology may be particularly good candidates for the approach advocated here.

pdf abs
A Lexical Resource for the Identification of “Weak Words” in German Specification Documents
Jennifer Krisch | Melanie Dick | Ronny Jauch | Ulrich Heid
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We report on the creation of a lexical resource for the identification of potentially unspecific or imprecise constructions in German requirements documentation from the car manufacturing industry. In requirements engineering, such expressions are called “weak words”: they are not sufficiently precise to ensure an unambiguous interpretation by the contractual partners, who for the definition of their cooperation, typically rely on specification documents (Melchisedech, 2000); an example are dimension adjectives, such as kurz or lang (‘short’, ‘long’) which need to be modified by adverbials indicating the exact duration, size etc. Contrary to standard practice in requirements engineering, where the identification of such weak words is merely based on stopword lists, we identify weak uses in context, by querying annotated text. The queries are part of the resource, as they define the conditions when a word use is weak. We evaluate the recognition of weak uses on our development corpus and on an unseen evaluation corpus, reaching stable F1-scores above 0.95.

2014

pdf
Combining bilingual terminology mining and morphological modeling for domain adaptation in SMT
Marion Weller | Alexander Fraser | Ulrich Heid
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf abs
The eIdentity Text Exploration Workbench
Fritz Kliche | André Blessing | Ulrich Heid | Jonathan Sonntag
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We work on tools to explore text contents and metadata of newspaper articles as provided by news archives. Our tool components are being integrated into an “Exploration Workbench” for Digital Humanities researchers. Next to the conversion of different data formats and character encodings, a prominent feature of our design is its “Wizard” function for corpus building: Researchers import raw data and define patterns to extract text contents and metadata. The Workbench also comprises different tools for data cleaning. These include filtering of off-topic articles, duplicates and near-duplicates, corrupted and empty articles. We currently work on ca. 860.000 newspaper articles from different media archives, provided in different data formats. We index the data with state-of-the-art systems to allow for large scale information retrieval. We extract metadata on publishing dates, author names, newspaper sections, etc., and split articles into segments such as headlines, subtitles, paragraphs, etc. After cleaning the data and compiling a thematically homogeneous corpus, the sample can be used for quantitative analyses which are not affected by noise. Users can retrieve sets of articles on different topics, issues or otherwise defined research questions (“subcorpora”) and investigate quantitatively their media attention on the timeline (“Issue Cycles”).

pdf abs
Adapting a part-of-speech tagset to non-standard text: The case of STTS
Heike Zinsmeister | Ulrich Heid | Kathrin Beck
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Stuttgart-Tübingen TagSet (STTS) is a de-facto standard for the part-of-speech tagging of German texts. Since its first publication in 1995, STTS has been used in a variety of annotation projects, some of which have adapted the tagset slightly for their specific needs. Recently, the focus of many projects has shifted from the analysis of newspaper text to that of non-standard varieties such as user-generated content, historical texts, and learner language. These text types contain linguistic phenomena that are missing from or are only suboptimally covered by STTS; in a community effort, German NLP researchers have therefore proposed additions to and modifications of the tagset that will handle these phenomena more appropriately. In addition, they have discussed alternative ways of tag assignment in terms of bipartite tags (stem, token) for historical texts and tripartite tags (lexicon, morphology, distribution) for learner texts. In this article, we report on this ongoing activity, addressing methodological issues and discussing selected phenomena and their treatment in the tagset adaptation process.

2013

pdf
Towards a Tool for Interactive Concept Building for Large Scale Analysis in the Humanities
Andre Blessing | Jonathan Sonntag | Fritz Kliche | Ulrich Heid | Jonas Kuhn | Manfred Stede
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf
Using a rich feature set for the identification of German MWEs
Fabienne Cap | Marion Weller | Ulrich Heid
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technologies

2012

pdf abs
French and German Corpora for Audience-based Text Type Classification
Amalia Todirascu | Sebastian Padó | Jennifer Krisch | Max Kisselew | Ulrich Heid
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents some of the results of the CLASSYN project which investigated the classification of text according to audience-related text types. We describe the design principles and the properties of the French and German linguistically annotated corpora that we have created. We report on tools used to collect the data and on the quality of the syntactic annotation. The CLASSYN corpora comprise two text collections to investigate general text types difference between scientific and popular science text on the two domains of medical and computer science.

pdf abs
Adapting and evaluating a generic term extraction tool
Anita Gojun | Ulrich Heid | Bernd Weißbach | Carola Loth | Insa Mingers
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present techniques for monolingual term candidate extraction which are being developed in the EU project TTC. We designed an application for German and English data that serves as a first evaluation of the methods for terminology extraction used in the project. The application situation highlighted the need for tools to handle lemmatization errors and to remove incomplete word sequences from multi-word term candidate lists, as well as the fact that the provision of German citation forms requires more morphological knowledge than TTC's slim approach can provide. We show a detailed evaluation of our extraction results and discuss the method for the evaluation of terminology extraction systems.

pdf abs
Analyzing and Aligning German compound nouns
Marion Weller | Ulrich Heid
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present and evaluate an approach for the compositional alignment of compound nouns using comparable corpora from technical domains. The task of term alignment consists in relating a source language term to its translation in a list of target language terms with the help of a bilingual dictionary. Compound splitting allows to transform a compound into a sequence of components which can be translated separately and then related to multi-word target language terms. We present and evaluate a method for compound splitting, and compare two strategies for term alignment (bag-of-word vs. pattern-based). The simple word-based approach leads to a considerable amount of erroneous alignments, whereas the pattern-based approach reaches a decent precision. We also assess the reasons for alignment failures: in the comparable corpora used for our experiments, a substantial number of terms has no translation in the target language data; furthermore, the non-isomorphic structures of source and target language terms cause alignment failures in many cases.

pdf abs
A Tool/Database Interface for Multi-Level Analyses
Kurt Eberle | Kerstin Eckart | Ulrich Heid | Boris Haselbach
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Depending on the nature of a linguistic theory, empirical investigations of its soundness may focus on corpus studies related to lexical, syntactic, semantic or other phenomena. Especially work in research networks usually comprises analyses of different levels of description, where each one must be as reliable as possible when the same sentences and texts are investigated under very different perspectives. This paper describes an infrastructure that interfaces an analysis tool for multi-level annotation with a generic relational database. It supports three dimensions of analysis-handling and thereby builds an integrated environment for quality assurance in corpus based linguistic analysis: a vertical dimension relating analysis components in a pipeline, a horizontal dimension taking alternative results of the same analysis level into account and a temporal dimension to follow up cases where analyses for the same input have been produced with different versions of a tool. As an example we give a detailed description of a typical workflow for the vertical dimension.

pdf
Approximating Theoretical Linguistics Classification in Real Data: the Case of German “nach” Particle Verbs
Boris Haselbach | Kerstin Eckart | Wolfgang Seeker | Kurt Eberle | Ulrich Heid
Proceedings of COLING 2012

2010

pdf abs
The Development of a Morphosyntactic Tagset for Afrikaans and its Use with Statistical Tagging
Boris Haselbach | Ulrich Heid
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a morphosyntactic tagset for Afrikaans based on the guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES). We compare our slim yet expressive tagset, MAATS (Morphosyntactic AfrikAans TagSet), with an existing one which primarily focuses on a detailed morphosyntactic and semantic description of word forms. MAATS will primarily be used for the extraction of lexical data from large pos-tagged corpora. We not only focus on morphosyntactic properties but also on the processability with statistical tagging. We discuss the tagset design and motivate our classification of Afrikaans word forms, in particular we focus on the categorization of verbs and conjunctions. The complete tagset in presented and we briefly discuss each word class. In a case study with an Afrikaans newspaper corpus, we evaluate our tagset with four different statistical taggers. Despite a relatively small amount of training data, however with a large tagger lexicon, TnT-Tagger scores 97.05 % accuracy. Additionally, we present some error sources and discuss future work.

pdf abs
Term and Collocation Extraction by Means of Complex Linguistic Web Services
Ulrich Heid | Fabienne Fritzinger | Erhard Hinrichs | Marie Hinrichs | Thomas Zastrow
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a web service-based environment for the use of linguistic resources and tools to address issues of terminology and language varieties. We discuss the architecture, corpus representation formats, components and a chainer supporting the combination of tools into task-specific services. Integrated into this environment, single web services also become part of complex scenarios for web service use. Our web services take for example corpora of several million words as an input on which they perform preprocessing, such as tokenisation, tagging, lemmatisation and parsing, and corpus exploration, such as collocation extraction and corpus comparison. Here we present an example on extraction of single and multiword items typical of a specific domain or typical of a regional variety of German. We also give a critical review on needs and available functions from a user's point of view. The work presented here is part of ongoing experimentation in the D-SPIN project, the German national counterpart of CLARIN.

pdf abs
Design and Application of a Gold Standard for Morphological Analysis: SMOR as an Example of Morphological Evaluation
Gertrud Faaß | Ulrich Heid | Helmut Schmid
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes general requirements for evaluating and documenting NLP tools with a focus on morphological analysers and the design of a Gold Standard. It is argued that any evaluation must be measurable and documentation thereof must be made accessible for any user of the tool. The documentation must be of a kind that it enables the user to compare different tools offering the same service, hence the descriptions must contain measurable values. A Gold Standard presents a vital part of any measurable evaluation process, therefore, the corpus-based design of a Gold Standard, its creation and problems that occur are reported upon here. Our project concentrates on SMOR, a morphological analyser for German that is to be offered as a web-service. We not only utilize this analyser for designing the Gold Standard, but also evaluate the tool itself at the same time. Note that the project is ongoing, therefore, we cannot present final results.

pdf abs
Extraction of German Multiword Expressions from Parsed Corpora Using Context Features
Marion Weller | Ulrich Heid
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We report about tools for the extraction of German multiword expressions (MWEs) from text corpora; we extract word pairs, but also longer MWEs of different patterns, e.g. verb-noun structures with an additional prepositional phrase or adjective. Next to standard association-based extraction, we focus on morpho-syntactic, syntactic and lexical-choice features of the MWE candidates. A broad range of such properties (e.g. number and definiteness of nouns, adjacency of the MWEs components and their position in the sentence, preferred lexical modifiers, etc.) along with relevant example sentences, are extracted from dependency-parsed text and stored in a data base. A sample precision evaluation and an analysis of extraction errors are provided along with the discussion of our extraction architecture. We furthermore measure the contribution of the features to the precision of the extraction: by using both morpho-syntactic and syntactic features, we achieve a higher precision in the identification of idiomatic MWEs, than by using only properties of one type.

pdf abs
Building a Cross-lingual Relatedness Thesaurus using a Graph Similarity Measure
Lukas Michelbacher | Florian Laws | Beate Dorow | Ulrich Heid | Hinrich Schütze
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Internet is an ever growing source of information stored in documents of different languages. Hence, cross-lingual resources are needed for more and more NLP applications. This paper presents (i) a graph-based method for creating one such resource and (ii) a resource created using the method, a cross-lingual relatedness thesaurus. Given a word in one language, the thesaurus suggests words in a second language that are semantically related. The method requires two monolingual corpora and a basic dictionary. Our general approach is to build two monolingual word graphs, with nodes representing words and edges representing linguistic relations between words. A bilingual dictionary containing basic vocabulary provides seed translations relating nodes from both graphs. We then use an inter-graph node-similarity algorithm to discover related words. Evaluation with three human judges revealed that 49% of the English and 57% of the German words discovered by our method are semantically related to the target words. We publish two resources in conjunction with this paper. First, noun coordinations extracted from the German and English Wikipedias. Second, the cross-lingual relatedness thesaurus which can be used in experiments involving interactive cross-lingual query expansion.

pdf abs
A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards
Ulrich Heid | Helmut Schmid | Kerstin Eckart | Erhard Hinrichs
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In the framework of the preparation of linguistic web services for corpus processing, the need for a representation format was felt, which supports interoperability between different web services in a corpus processing pipeline, but also provides a well-defined interface to both, legacy tools and their data formats and upcoming international standards. We present the D-SPIN text corpus format, TCF, which was designed for this purpose. It is a stand-off XML format, inspired by the philosophy of the emerging standards LAF (Linguistic Annotation Framework) and its ``instances'' MAF for morpho-syntactic annotation and SynAF for syntactic annotation. Tools for the exchange with existing (best practice) formats are available, and a converter from MAF to TCF is being tested in spring 2010. We describe the usage scenario where TCF is embedded and the properties and architecture of TCF. We also give examples of TCF encoded data and describe the aspects of syntactic and semantic interoperability already addressed.

pdf abs
A Survey of Idiomatic Preposition-Noun-Verb Triples on Token Level
Fabienne Fritzinger | Marion Weller | Ulrich Heid
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Most of the research on the extraction of idiomatic multiword expressions (MWEs) focused on the acquisition of MWE types. In the present work we investigate whether a text instance of a potentially idiomatic MWE is actually used idiomatically in a given context or not. Inspired by the dataset provided by (Cook et al., 2008), we manually analysed 9,700 instances of potentially idiomatic prepositionnoun- verb triples (a frequent pattern among German MWEs) to identify, on token level, idiomatic vs. literal uses. In our dataset, all sentences are provided along with their morpho-syntactic properties. We describe our data extraction and annotation steps, and we discuss quantitative results from both EUROPARL and a German newspaper corpus. We discuss the relationship between idiomaticity and morpho-syntactic fixedness, and we address issues of ambiguity between literal and idiomatic use of MWEs. Our data show that EUROPARL is particularly well suited for MWE extraction, as most MWEs in this corpus are indeed used only idiomatically.

2009

pdf
Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words
Gertrud Faaß | Ulrich Heid | Elsabé Taljard | Danie Prinsloo
Proceedings of the First Workshop on Language Technologies for African Languages

2008

pdf
Formalising Multi-layer Corpora in OWL DL - Lexicon Modelling, Querying and Consistency Control
Aljoscha Burchardt | Sebastian Padó | Dennis Spohr | Anette Frank | Ulrich Heid
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf abs
Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case
Kremena Ivanova | Ulrich Heid | Sabine Schulte im Walde | Adam Kilgarriff | Jan Pomikálek
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Word sketches are part of the Sketch Engine corpus query system. They represent automatic, corpus-derived summaries of the words grammatical and collocational behaviour. Besides the corpus itself, word sketches require a sketch grammar, a regular expression-based shallow grammar over the part-of-speech tags, to extract evidence for the properties of the targeted words from the corpus. The paper presents a sketch grammar for German, a language which is not strictly configurational and which shows a considerable amount of case syncretism, and evaluates its accuracy, which has not been done for other sketch grammars. The evaluation focuses on NP case as a crucial part of the German grammar. We present various versions of NP definitions, so demonstrating the influence of grammar detail on precision and recall.

pdf abs
Tools for Collocation Extraction: Preferences for Active vs. Passive
Ulrich Heid | Marion Weller
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present and partially evaluate procedures for the extraction of noun+verb collocation candidates from German text corpora, along with their morphosyntactic preferences, especially for the active vs. passive voice. We start from tokenized, tagged, lemmatized and chunked text, and we use extraction patterns formulated in the CQP corpus query language. We discuss the results of a precision evaluation, on administrative texts from the European Union: we find a considerable amount of specialized collocations, as well as general ones and complex predicates; overall the precision is considerably higher than that of a statistical extractor used as a baseline.

We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language-indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho-syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicators). The extracted candidates are validated and classified manually.

pdf abs
Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds.
Ekaterina Lapshinova-Koltunski | Ulrich Heid
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we discuss an approach to the semi-automatic extraction and classification of the compounds extracted from German corpora. Compound nominals are semi-automatically extracted from text corpora along with their sentential complements. In this study we concentrate on that, wh or if subclauses although our methods can be applied to other complements as well. We elaborate an architecture using linguistic knowledge about the phenomena we extract, and aim at answering the following questions: how can data about subcategorisation properties of nominal compounds be extracted from text corpora, and how can compounds be classified according to their subcategorisation properties? Our classification is based on the relationships between the subcategorisation of nominal compounds, e.g. Grundfrage, Wettstreit and Beweismittel, and that of their constituent parts, such as Frage, Streit, Beweis, etc. We show that there are cases which do not match the commonly accepted assumption that the head of a compound is its valency bearer. Such cases should receive a specific treatment in NLP dictionary building. This calls for tools to identify and classify such cases by means of data extraction from corpora. We propose precision-oriented semiautomatic extraction which can operate on tokenized, tagged and lemmatized texts. In the future, we are going to extend the kinds of extracted complements beyond subclauses and analyze the nature of the non-head valency-bearer of compounds, as well as an extension of the kinds of extracted complements beyond subclauses.

pdf abs
A LAF/GrAF based Encoding Scheme for underspecified Representations of syntactic Annotations.
Manuel Kountz | Ulrich Heid | Kerstin Eckart
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Data models and encoding formats for syntactically annotated text corpora need to deal with syntactic ambiguity; underspecified representations are particularly well suited for the representation of ambiguous data because they allow for high informational efficiency. We discuss the issue of being informationally efficient, and the trade-off between efficient encoding of linguistic annotations and complete documentation of linguistic analyses. The main topic of this article is a data model and an encoding scheme based on LAF/GrAF (Ide and Romary, 2006; Ide and Suderman, 2007) which provides a flexible framework for encoding underspecified representations. We show how a set of dependency structures and a set of TiGer graphs (Brants et al., 2002) representing the readings of an ambiguous sentence can be encoded, and we discuss basic issues in querying corpora which are encoded using the framework presented here.

2006

pdf bib
Proceedings of the Workshop on Multilingual Language Resources and Interoperability
Andreas Witt | Gilles Sérasset | Susan Armstrong | Jim Breen | Ulrich Heid | Felix Sasaki
Proceedings of the Workshop on Multilingual Language Resources and Interoperability

pdf
Modeling Monolingual and Bilingual Collocation Dictionaries in Description Logics
Dennis Spohr | Ulrich Heid
Proceedings of the Workshop on Multi-word-expressions in a multilingual context

pdf abs
Extraction tools for collocations and their morphosyntactic specificities
Julia Ritz | Ulrich Heid
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe tools for the extraction of collocations not only in the form of word combinations, but also of data about the morphosyntactic properties of collocation candidates. Such data are needed for a detailed lexical description of collocations, and to support both their recognition in text and the generation of collocationally acceptable text. We describe the tool architecture, report on a case study based on noun+verb collocations, and we give a first rough evaluation of the data quality produced.

pdf abs
Grammar-based tools for the creation of tagging resources for an unresourced language: the case of Northern Sotho
Ulrich Heid | Elsabé Taljard | Danie J. Prinsloo
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe an architecture for the parallel construction of a tagger lexicon and an annotated reference corpus for the part-of-speech tagging of Nothern Sotho, a Bantu language of South Africa, for which no tagged resources have been available so far. Our tools make use of grammatical properties (morphological and syntactic) of the language. We use symbolic pretagging, followed by stochastic tagging, an architecture which proves useful not only for the bootstrapping of tagging resources, but also for the tagging of any new text. We discuss the tagset design, the tool architecture and the current state of our ongoing effort.

In this paper, we introduce the methodology for the construction of dictionary fragments under development in DELIS. The approach advocated is corpus-based, computationally supported, and aimed at the construction of parallel monolingual dictionary fragments which can be linked to form translation dictionaries without many problems.The parallelism of the monolingual fragments is achieved through the use of a shared inventory of descriptive devices, one common representation formalism (typed feature structures) for linguistic information from all levels, as well as a working methodology inspired by onomasiology: treating all elements of a given lexical semantic field consistently with common descriptive devices at the same time.It is claimed that such monolingual dictionaries are particularly easy to relate in a machine translation application. The principles of such a combination of dictionary fragments are illustrated with examples from an experimental HPSG-based interlingua-oriented machine translation prototype.