Igor Sominsky
2010
The ConceptMapper Approach to Named Entity Recognition
Michael Tanenblatt
|
Anni Coden
|
Igor Sominsky
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
ConceptMapper is an open source tool we created for classifying mentions in an unstructured text document based on concept terminologies (dictionaries) and yielding named entities as output. It is implemented as a UIMA (Unstructured Information Management Architecture) annotator and is highly configurable: concepts can come from standardised or proprietary terminologies; arbitrary attributes can be associated with dictionary entries, and those attributes can then be associated with the named entities in the output; numerous search strategies and search options can be specified; any tokenizer packaged as a UIMA annotator can be used to tokenize the dictionary, so the same tokenization can be guaranteed for the input and dictionary, minimising tokenization mismatch errors; and the types and features of UIMA annotations used as input and generated as output can also be controlled. We describe ConceptMapper and its configuration parameters and their trade-offs, then describe the results of an experiment wherein some of these parameters are varied and precision and recall are subsequently measured in the task of in identifying concepts in a collection English-language clinical reports (colon cancer pathology). ConceptMapper is available from the Apache UIMA Sandbox, covered by the Apache Open Source license.