Boris Haselbach


2012

pdf bib
Approximating Theoretical Linguistics Classification in Real Data: the Case of German “nach” Particle Verbs
Boris Haselbach | Kerstin Eckart | Wolfgang Seeker | Kurt Eberle | Ulrich Heid
Proceedings of COLING 2012

pdf bib
German nach-Particle Verbs in Semantic Theory and Corpus Data
Boris Haselbach | Wolfgang Seeker | Kerstin Eckart
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present a database-supported corpus study where we combine automatically obtained linguistic information from a statistical dependency parser, namely the occurrence of a dative argument, with predictions from a theory on the argument structure of German particle verbs with """"nach"""". The theory predicts five readings of """"nach"""" which behave differently with respect to dative licensing in their argument structure. From a huge German web corpus, we extracted sentences for a subset of """"nach""""-particle verbs for which no dative is expected by the theory. Making use of a relational database management system, we bring together the corpus sentences and the lemmas manually annotated along the lines of the theory. We validate the theoretical predictions against the syntactic structure of the corpus sentences, which we obtained from a statistical dependency parser. We find that, in principle, the theory is borne out by the data, however, manual error analysis reveals cases for which the theory needs to be extended.

pdf bib
A Tool/Database Interface for Multi-Level Analyses
Kurt Eberle | Kerstin Eckart | Ulrich Heid | Boris Haselbach
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Depending on the nature of a linguistic theory, empirical investigations of its soundness may focus on corpus studies related to lexical, syntactic, semantic or other phenomena. Especially work in research networks usually comprises analyses of different levels of description, where each one must be as reliable as possible when the same sentences and texts are investigated under very different perspectives. This paper describes an infrastructure that interfaces an analysis tool for multi-level annotation with a generic relational database. It supports three dimensions of analysis-handling and thereby builds an integrated environment for quality assurance in corpus based linguistic analysis: a vertical dimension relating analysis components in a pipeline, a horizontal dimension taking alternative results of the same analysis level into account and a temporal dimension to follow up cases where analyses for the same input have been produced with different versions of a tool. As an example we give a detailed description of a typical workflow for the vertical dimension.

2010

pdf bib
The Development of a Morphosyntactic Tagset for Afrikaans and its Use with Statistical Tagging
Boris Haselbach | Ulrich Heid
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a morphosyntactic tagset for Afrikaans based on the guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES). We compare our slim yet expressive tagset, MAATS (Morphosyntactic AfrikAans TagSet), with an existing one which primarily focuses on a detailed morphosyntactic and semantic description of word forms. MAATS will primarily be used for the extraction of lexical data from large pos-tagged corpora. We not only focus on morphosyntactic properties but also on the processability with statistical tagging. We discuss the tagset design and motivate our classification of Afrikaans word forms, in particular we focus on the categorization of verbs and conjunctions. The complete tagset in presented and we briefly discuss each word class. In a case study with an Afrikaans newspaper corpus, we evaluate our tagset with four different statistical taggers. Despite a relatively small amount of training data, however with a large tagger lexicon, TnT-Tagger scores 97.05 % accuracy. Additionally, we present some error sources and discuss future work.