2024
pdf
EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research
Marc Kupietz
|
Piotr Banski
|
Nils Diewald
|
Beata Trawinski
|
Andreas Witt
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024
2022
pdf
bib
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
Piotr Banski
|
Adrien Barbaresi
|
Simon Clematide
|
Marc Kupietz
|
Harald Lüngen
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
pdf
abs
Count-Based and Predictive Language Models for Exploring DeReKo
Peter Fankhauser
|
Marc Kupietz
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
We present the use of count-based and predictive language models for exploring language use in the German Reference Corpus DeReKo. For collocation analysis along the syntagmatic axis we employ traditional association measures based on co-occurrence counts as well as predictive association measures derived from the output weights of skipgram word embeddings. For inspecting the semantic neighbourhood of words along the paradigmatic axis we visualize the high dimensional word embeddings in two dimensions using t-stochastic neighbourhood embeddings. Together, these visualizations provide a complementary, explorative approach to analysing very large corpora in addition to corpus querying. Moreover, we discuss count-based and predictive models w.r.t. scalability and maintainability in very large corpora.
2021
pdf
abs
Data-driven Identification of Idioms in Song Lyrics
Miriam Amin
|
Peter Fankhauser
|
Marc Kupietz
|
Roman Schneider
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)
The automatic recognition of idioms poses a challenging problem for NLP applications. Whereas native speakers can intuitively handle multiword expressions whose compositional meanings are hard to trace back to individual word semantics, there is still ample scope for improvement regarding computational approaches. We assume that idiomatic constructions can be characterized by gradual intensities of semantic non-compositionality, formal fixedness, and unusual usage context, and introduce a number of measures for these characteristics, comprising count-based and predictive collocation measures together with measures of context (un)similarity. We evaluate our approach on a manually labelled gold standard, derived from a corpus of German pop lyrics. To this end, we apply a Random Forest classifier to analyze the individual contribution of features for automatically detecting idioms, and study the trade-off between recall and precision. Finally, we evaluate the classifier on an independent dataset of idioms extracted from a list of Wikipedia idioms, achieving state-of-the art accuracy.
2020
pdf
bib
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
Piotr Bański
|
Adrien Barbaresi
|
Simon Clematide
|
Marc Kupietz
|
Harald Lüngen
|
Ines Pisetta
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
pdf
bib
abs
Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora
Denis Arnold
|
Bernhard Fisseni
|
Pawel Kamocki
|
Oliver Schonefeld
|
Marc Kupietz
|
Thomas Schmidt
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).
pdf
bib
abs
Evaluating a Dependency Parser on DeReKo
Peter Fankhauser
|
Bich-Ngoc Do
|
Marc Kupietz
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
We evaluate a graph-based dependency parser on DeReKo, a large corpus of contemporary German. The dependency parser is trained on the German dataset from the SPMRL 2014 Shared Task which contains text from the news domain, whereas DeReKo also covers other domains including fiction, science, and technology. To avoid the need for costly manual annotation of the corpus, we use the parser’s probability estimates for unlabeled and labeled attachment as main evaluation criterion. We show that these probability estimates are highly correlated with the actual attachment scores on a manually annotated test set. On this basis, we compare estimated parsing scores for the individual domains in DeReKo, and show that the scores decrease with increasing distance of a domain to the training corpus.
pdf
abs
RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP
Marc Kupietz
|
Nils Diewald
|
Eliza Margaretha
Proceedings of the Twelfth Language Resources and Evaluation Conference
Making corpora accessible and usable for linguistic research is a huge challenge in view of (too) big data, legal issues and a rapidly evolving methodology. This does not only affect the design of user-friendly graphical interfaces to corpus analysis tools, but also the availability of programming interfaces supporting access to the functionality of these tools from various analysis and development environments. RKorAPClient is a new research tool in the form of an R package that interacts with the Web API of the corpus analysis platform KorAP, which provides access to large annotated corpora, including the German reference corpus DeReKo with 45 billion tokens. In addition to optionally authenticated KorAP API access, RKorAPClient provides further processing and visualization features to simplify common corpus analysis tasks. This paper introduces the basic functionality of RKorAPClient and exemplifies various analysis tasks based on DeReKo, that are bundled within the R package and can serve as a basic framework for advanced analysis and visualization approaches.
2018
pdf
The German Reference Corpus DeReKo: New Developments – New Opportunities
Marc Kupietz
|
Harald Lüngen
|
Paweł Kamocki
|
Andreas Witt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
abs
KorAP Architecture ― Diving in the Deep Sea of Corpus Data
Nils Diewald
|
Michael Hanl
|
Eliza Margaretha
|
Joachim Bingel
|
Marc Kupietz
|
Piotr Bański
|
Andreas Witt
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DeReKo for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
2014
pdf
abs
Access control by query rewriting: the case of KorAP
Piotr Bański
|
Nils Diewald
|
Michael Hanl
|
Marc Kupietz
|
Andreas Witt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given users permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.
pdf
abs
Recent Developments in DeReKo
Marc Kupietz
|
Harald Lüngen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future research interface. Due to the recent acquisition of new licenses, DeReKo has grown by a factor of four in the first half of 2014, mostly in the area of newspaper text, and presently contains over 24 billion word tokens. Other strata, like fictional texts, web corpora, in particular CMC texts, and spoken but conceptually written texts have also increased significantly. We report on the newly acquired corpora that led to the major increase, on the principles and strategies behind our corpus acquisition activities, and on our solutions for the emerging legal, organisational, and technical challenges.
2012
pdf
abs
The New IDS Corpus Analysis Platform: Challenges and Prospects
Piotr Bański
|
Peter M. Fischer
|
Elena Frick
|
Erik Ketzan
|
Marc Kupietz
|
Carsten Schnober
|
Oliver Schonefeld
|
Andreas Witt
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.
2010
pdf
abs
The German Reference Corpus DeReKo: A Primordial Sample for Linguistic Research
Marc Kupietz
|
Cyril Belica
|
Holger Keibel
|
Andreas Witt
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes DeReKo (Deutsches Referenzkorpus), the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS) in Mannheim, and the rationale behind its development. We discuss its design, its legal background, how to access it, available metadata, linguistic annotation layers, underlying standards, ongoing developments, and aspects of using the archive for empirical linguistic research. The focus of the paper is on the advantages of DeReKo's design as a primordial sample from which virtual corpora can be drawn for the specific purposes of individual studies. Both concepts, primordial sample and virtual corpus are explained and illustrated in detail. Furthermore, we describe in more detail how DeReKo deals with the fact that all its texts are subject to third parties' intellectual property rights, and how it deals with the issue of replicability, which is particularly challenging given DeReKo's dynamic growth and the possibility to construct from it an open number of virtual corpora.