A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters

Grzegorz Chrupała, Dietrich Klakow


Abstract
Named Entity Recognition is a relatively well-understood NLP task, with many publicly available training resources and software for processing English data. Other languages tend to be underserved in this area. For German, CoNLL-2003 Shared Task provided training data, but there are no publicly available, ready-to-use tools. We fill this gap and develop a German NER system with state-of-the-art performance. In addition to CoNLL 2003 labeled training data, we use two additional resources: (i) 32 million words of unlabeled news article text and (ii) infobox labels from German Wikipedia articles. From the unlabeled text we derive distributional word clusters. Then we use cluster membership features and Wikipedia infobox label features to train a supervised model on the labeled training data. This approach allows us to deal better with word-types unseen in the training data and achieve good performance on German with little engineering effort.
Anthology ID:
L10-1371
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/538_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Grzegorz Chrupała and Dietrich Klakow. 2010. A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters (Chrupała & Klakow, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/538_Paper.pdf