Indexing Methods for Faster and More Effective Person Name Search

Mark Arehart


Abstract
This paper compares several indexing methods for person names extracted from text, developed for an information retrieval system with requirements for fast approximate matching of noisy and multicultural Romanized names. Such matching algorithms are computationally expensive and unacceptably slow when used without an indexing or blocking step. The goal is to create a small candidate pool containing all the true matches that can be exhaustively searched by a more effective but slower name comparison method. In addition to dramatically faster search, some of the methods evaluated here led to modest gains in effectiveness by eliminating false positives. Four indexing techniques using either phonetic keys or substrings of name segments, with and without name segment stopword lists, were combined with three name matching algorithms. On a test set of 700 queries run against 70K noisy and multicultural names, the best-performing technique took just 2.1% as long as a naive exhaustive search and increased F1 by 3 points, showing that an appropriate indexing technique can increase both speed and effectiveness.
Anthology ID:
L10-1107
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/166_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mark Arehart. 2010. Indexing Methods for Faster and More Effective Person Name Search. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Indexing Methods for Faster and More Effective Person Name Search (Arehart, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/166_Paper.pdf