Abstract
This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future research interface. Due to the recent acquisition of new licenses, DeReKo has grown by a factor of four in the first half of 2014, mostly in the area of newspaper text, and presently contains over 24 billion word tokens. Other strata, like fictional texts, web corpora, in particular CMC texts, and spoken but conceptually written texts have also increased significantly. We report on the newly acquired corpora that led to the major increase, on the principles and strategies behind our corpus acquisition activities, and on our solutions for the emerging legal, organisational, and technical challenges.- Anthology ID:
- L14-1648
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2378–2385
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/842_Paper.pdf
- DOI:
- Cite (ACL):
- Marc Kupietz and Harald Lüngen. 2014. Recent Developments in DeReKo. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2378–2385, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- Recent Developments in DeReKo (Kupietz & Lüngen, LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/842_Paper.pdf