Korp — the corpus infrastructure of Språkbanken

Lars Borin, Markus Forsberg, Johan Roxendal


Abstract
We present Korp, the corpus infrastructure of Språkbanken (the Swedish Language Bank). The infrastructure consists of three main components: the Korp corpus pipeline, the Korp backend, and the Korp frontend. The Korp corpus pipeline is used for importing corpora, annotating them, and then exporting the annotated corpora into different formats. An essential feature of the pipeline is the ability to leave existing annotations untouched, both structural and word level annotations, and to use the existing annotations as the foundation of other annotations. The Korp backend consists of a set of REST-based web services for searching in and retrieving information about the corpora. Finally, the Korp frontend is a graphical search interface that interacts with the Korp backend. The interface has been inspired by corpus search interfaces such as SketchEngine, Glossa, and DeepDict, and it uses State Chart XML (SCXML) in order to enable users to bookmark interaction states. We give a functional and technical overview of the three components, followed by a discussion of planned future work.
Anthology ID:
L12-1098
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
474–478
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/248_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Lars Borin, Markus Forsberg, and Johan Roxendal. 2012. Korp — the corpus infrastructure of Språkbanken. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 474–478, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Korp — the corpus infrastructure of Språkbanken (Borin et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/248_Paper.pdf