Identifying Named Entities in Text Databases from the Natural History Domain

Caroline Sporleder, Marieke van Erp, Tijn Porcelijn, Antal van den Bosch, Pim Arntzen


Abstract
In this paper, we investigate whether it is possible to bootstrap a named entity tagger for textual databases by exploiting the database structure to automatically generate domain and database-specific gazetteer lists. We compare three tagging strategies: (i) using the extracted gazetteers in a look-up tagger, (ii) using the gazetteers to automatically extract training data to train a database-specific tagger, and (iii) using a generic named entity tagger. Our results suggest that automatically built gazetteers in combination with a look-up tagger lead to a relatively good performance and that generic taggers do not perform particularly well on this type of data.
Anthology ID:
L06-1288
Volume:
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:
May
Year:
2006
Address:
Genoa, Italy
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/482_pdf.pdf
DOI:
Bibkey:
Cite (ACL):
Caroline Sporleder, Marieke van Erp, Tijn Porcelijn, Antal van den Bosch, and Pim Arntzen. 2006. Identifying Named Entities in Text Databases from the Natural History Domain. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):
Identifying Named Entities in Text Databases from the Natural History Domain (Sporleder et al., LREC 2006)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/482_pdf.pdf