Thomas Zastrow

2012

pdf bib abs
Automatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC
Erhard Hinrichs | Thomas Zastrow
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents the TuÌbingen Baumbank des Deutschen Diachron (TuÌBa-D/DC), a linguistically annotated corpus of selected diachronic materials from the German Gutenberg Project. It was automatically annotated by a suite of NLP tools integrated into WebLicht, the linguistic chaining tool used in CLARIN-D. The annotation quality has been evaluated manually for a subcorpus ranging from Middle High German to Modern High German. The integration of the TuÌBa-D/DC into the CLARIN-D infrastructure includes metadata provision and harvesting as well as sustainable data storage in the TuÌbingen CLARIN-D center. The paper further provides an overview of the possibilities of accessing the TuÌBa-D/DC data. Methods for full-text search of the metadata and object data and for annotation-based search of the object data are described in detail. The WebLicht Service Oriented Architecture is used as an integrated environment for annotation based search of the TuÌBa-D/DC. WebLicht thus not only serves as the annotation platform for the TuÌBa-D/DC, but also as a generic user interface for accessing and visualizing it.

This paper presents the system architecture as well as the underlying workflow of the Extensible Repository System of Digital Objects (ERDO) which has been developed for the sustainable archiving of language resources within the Tübingen CLARIN-D project. In contrast to other approaches focusing on archiving experts, the described workflow can be used by researchers without required knowledge in the field of long-term storage for transferring data from their local file systems into a persistent repository.

2010

pdf bib abs
Sustainability of Linguistic Data and Analysis in the Context of a Collaborative eScience Environment
Erhard Hinrichs | Verena Henrich | Thomas Zastrow
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

For researchers, it is especially important that primary research data are preserved and made available on a long-term basis and to a wide variety of researchers. In order to ensure long-term availability of the archived data, it is imperative that the data to be stored is conformant with standardized data formats and best practices followed by the relevant research communities. Storing, managing, and accessing such standard-conformant data requires a repository-based infrastructure. Two projects at the University of Tübingen are realizing a collaborative eScience research environment with the help of eSciDoc for the university that supports long-term preservation of all kinds of data as well as a fine-grained and contextualized data management: the INF project and the BW-eSci(T) project. The task of the infrastructure (INF) project within the collaborative research centre âEmergence of Meaning (SFB 833) is to guarantee the long-term availability of the SFBs data. BW-eSci(T) is a joint project of the University of Tübingen and the Fachinformationszentrums (FIZ) Karlsruhe. The goal of this project is to develop a prototypical eScience research environment for the University of Tübingen.

pdf bib abs
WebLicht: Web-based LRT Services in a Distributed eScience Infrastructure
Marie Hinrichs | Thomas Zastrow | Erhard Hinrichs
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

eScience - enhanced science - is a new paradigm of scientific work and research. In the humanities, eScience environments can be helpful in establishing new workflows and lifecycles of scientific data. WebLicht is such an eScience environment for linguistic analysis, making linguistic tools and resources available network-wide. Today, most digital language resources and tools (LRT) are available by download only. This is inconvenient for someone who wants to use and combine several tools because these tools are normally not compatible with each other. To overcome this restriction, WebLicht makes the functionality of linguistic tools and the resources themselves available via the internet as web services. In WebLicht, several kinds of linguistic tools are available which cover the basic functionality of automatic and incremental creation of annotated text corpora. To make use of the more than 70 tools and resources currently available, the end user needs nothing more than just a common web browser.

pdf bib abs
Term and Collocation Extraction by Means of Complex Linguistic Web Services
Ulrich Heid | Fabienne Fritzinger | Erhard Hinrichs | Marie Hinrichs | Thomas Zastrow
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a web service-based environment for the use of linguistic resources and tools to address issues of terminology and language varieties. We discuss the architecture, corpus representation formats, components and a chainer supporting the combination of tools into task-specific services. Integrated into this environment, single web services also become part of complex scenarios for web service use. Our web services take for example corpora of several million words as an input on which they perform preprocessing, such as tokenisation, tagging, lemmatisation and parsing, and corpus exploration, such as collocation extraction and corpus comparison. Here we present an example on extraction of single and multiword items typical of a specific domain or typical of a regional variety of German. We also give a critical review on needs and available functions from a user's point of view. The work presented here is part of ongoing experimentation in the D-SPIN project, the German national counterpart of CLARIN.

pdf bib
WebLicht: Web-Based LRT Services for German
Erhard Hinrichs | Marie Hinrichs | Thomas Zastrow
Proceedings of the ACL 2010 System Demonstrations