Clustering of Multi-Word Named Entity variants: Multilingual Evaluation

Guillaume Jacquet, Maud Ehrmann, Ralf Steinberger


Abstract
Multi-word entities, such as organisation names, are frequently written in many different ways. We have previously automatically identified over one million acronym pairs in 22 languages, consisting of their short form (e.g. EC) and their corresponding long forms (e.g. European Commission, European Union Commission). In order to automatically group such long form variants as belonging to the same entity, we cluster them, using bottom-up hierarchical clustering and pair-wise string similarity metrics. In this paper, we address the issue of how to evaluate the named entity variant clusters automatically, with minimal human annotation effort. We present experiments that make use of Wikipedia redirection tables and we show that this method produces good results.
Anthology ID:
L14-1396
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2548–2553
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/468_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Guillaume Jacquet, Maud Ehrmann, and Ralf Steinberger. 2014. Clustering of Multi-Word Named Entity variants: Multilingual Evaluation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2548–2553, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Clustering of Multi-Word Named Entity variants: Multilingual Evaluation (Jacquet et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/468_Paper.pdf