GATEtoGerManC: A GATE-based Annotation Pipeline for Historical German
Silke Scheible, Richard J. Whitt, Martin Durrell, Paul Bennett
Abstract
We describe a new GATE-based linguistic annotation pipeline for Early Modern German, which can be used to annotate historical texts with word tokens, sentence boundaries, lemmas, and POS tags. The pipeline is based on a customisation of the freely available ANNIE system for English (Cunningham et al., 2002), in combination with a version of the TreeTagger (Schmid, 1994) trained on gold standard Early Modern German data. The POS-tagging and lemmatisation components of the pipeline achieve an average accuracy of 89.44% and 83.16%, respectively, on unseen historical data from various genres and publication dates within the Early Modern period. We show that normalisation of spelling variation can further improve these results. With no specialised tools available for processing this particular stage of the language, this pipeline will be of particular interest to smaller, humanities-based projects wishing to add linguistic annotations to their historical data but which lack the means or resources to develop such tools themselves.- Anthology ID:
- L12-1584
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3611–3617
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/978_Paper.pdf
- DOI:
- Cite (ACL):
- Silke Scheible, Richard J. Whitt, Martin Durrell, and Paul Bennett. 2012. GATEtoGerManC: A GATE-based Annotation Pipeline for Historical German. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3611–3617, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- GATEtoGerManC: A GATE-based Annotation Pipeline for Historical German (Scheible et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/978_Paper.pdf