Timm Lehmberg
2026
Text+: A National Hub Including Legacy Language Data
Florian Barth | Christoph Draxler | Jennifer Ecker | Stefan Fischer | Philippe Genêt | Alina Hemmer | Timm Lehmberg | Thorsten Trippel | Andreas Witt | Arden Zimmermann | Claus Zinn
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Florian Barth | Christoph Draxler | Jennifer Ecker | Stefan Fischer | Philippe Genêt | Alina Hemmer | Timm Lehmberg | Thorsten Trippel | Andreas Witt | Arden Zimmermann | Claus Zinn
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Text+ is the German distributed research data infrastructure for literary studies, linguistics, and spoken and written language. Its resources consist of contemporary and historical literary and media texts, deeply annotated material, transcripts of spoken and sign language, and original recordings. Text+ provides access to its resources according to the FAIR guidelines: Findable due to standard-conformant metadata, Accessible with single sign-on authentication, Interoperable via open data formats, and Reproducible through web services and extensive documentation. The 30+ partners of Text+ are archives, libraries, universities, and other research institutions. The partners are autonomous, and they differ in the amount of data and processing capabilities they provide. In this paper, we describe the hub architecture of Text+, which gives users a central and FAIR point of access to research data that continues to be distributed across the Text+ partner institutions. The architecture serves as a blueprint to evolving research infrastructures that aim at maintaining (and empowering) their research data contributors.
2022
Bringing Together Version Control and Quality Assurance of Language Data with LAMA
Aleksandr Riaposov | Elena Lazarenko | Timm Lehmberg
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
Aleksandr Riaposov | Elena Lazarenko | Timm Lehmberg
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
This contribution reports on work in process on project specific software and digital infrastructure components used along with corpus curation workflows in the the framework of the long-term language documentation project INEL. By bringing together scientists with different levels of technical affinity in a highly interdisciplinary working environment, the project is confronted with numerous workflow related issues. Many of them result from collaborative (remote-)work on digital corpora, which, among other things, include annotation, glossing but also quality- and consistency control. In this context several steps were taken to bridge the gap between usability and the requirements of complex data curation workflows. Components of the latter such as a versioning system and semi-automated data validators on one side meet the user demands for the simplicity and minimalism on the other side. Embodying a simple shell script in an interactive graphic user interface, we augment the efficacy of the data versioning and the integration of Java-based quality control and validation tools.
2020
Towards Flexible Cross-Resource Exploitation of Heterogeneous Language Documentation Data
Daniel Jettka | Timm Lehmberg
Proceedings of the Twelfth Language Resources and Evaluation Conference
Daniel Jettka | Timm Lehmberg
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper reports on challenges and solution approaches in the development of methods for language resource overarching data analysis in the field of language documentation. It is based on the successful outcomes of the initial phase of an 18 year long-term project on lesser resourced and mostly endangered indigenous languages of the Northern Eurasian area, which included the finalization and publication of multiple language corpora and additional language resources. While aiming at comprehensive cross-resource data analysis, the project at the same time is confronted with a dynamic and complex resource landscape, especially resulting from a vast amount of multi-layered information stored in the form of analogue primary data in different widespread archives on the territory of the Russian Federation. The methods described aim at solving the tension between unification of data sets and vocabularies on the one hand and maximum openness for the integration of future resources and adaption of external information on the other hand.
2018
Introducing the CLARIN Knowledge Centre for Linguistic Diversity and Language Documentation
Hanna Hedeland | Timm Lehmberg | Felix Rau | Sophie Salffner | Mandana Seyfeddinipur | Andreas Witt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Hanna Hedeland | Timm Lehmberg | Felix Rau | Sophie Salffner | Mandana Seyfeddinipur | Andreas Witt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2008
The Metadata-Database of a Next Generation Sustainability Web-Platform for Language Resources
Georg Rehm | Oliver Schonefeld | Andreas Witt | Timm Lehmberg | Christian Chiarcos | Hanan Bechara | Florian Eishold | Kilian Evang | Magdalena Leshtanska | Aleksandar Savkov | Matthias Stark
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Georg Rehm | Oliver Schonefeld | Andreas Witt | Timm Lehmberg | Christian Chiarcos | Hanan Bechara | Florian Eishold | Kilian Evang | Magdalena Leshtanska | Aleksandar Savkov | Matthias Stark
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.
Search
Fix author
Co-authors
- Andreas Witt 3
- Florian Barth 1
- Hanan Bechara 1
- Christian Chiarcos 1
- Christoph Draxler 1
- Jennifer Ecker 1
- Florian Eishold 1
- Kilian Evang 1
- Stefan Fischer 1
- Philippe Genêt 1
- Hanna Hedeland 1
- Alina Hemmer 1
- Daniel Jettka 1
- Elena Lazarenko 1
- Magdalena Leshtanska 1
- Felix Rau 1
- Georg Rehm 1
- Aleksandr Riaposov 1
- Sophie Salffner 1
- Aleksandar Savkov 1
- Oliver Schonefeld 1
- Mandana Seyfeddinipur 1
- Matthias Stark 1
- Thorsten Trippel 1
- Arden Zimmermann 1
- Claus Zinn 1