Abstract
Clustering unlinkable entity mentions across documents in multiple languages (cross-lingual NIL Clustering) is an important task as part of Entity Discovery and Linking (EDL). This task has been largely neglected by the EDL community because it is challenging to outperform simple edit distance or other heuristics based baselines. We propose a novel approach based on encoding the orthographic similarity of the mentions using a Recurrent Neural Network (RNN) architecture. Our model adapts a training procedure from the one-shot facial recognition literature in order to achieve this. We also perform several exploratory probing tasks on our name encodings in order to determine what specific types of information are likely to be encoded by our model. Experiments show our approach provides up to a 6.6% absolute CEAFm F-Score improvement over state-of-the-art methods and successfully captures phonological relations across languages.- Anthology ID:
- W19-2804
- Volume:
- Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference
- Month:
- June
- Year:
- 2019
- Address:
- Minneapolis, USA
- Editors:
- Maciej Ogrodniczuk, Sameer Pradhan, Yulia Grishina, Vincent Ng
- Venue:
- CRAC
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 20–25
- Language:
- URL:
- https://aclanthology.org/W19-2804
- DOI:
- 10.18653/v1/W19-2804
- Cite (ACL):
- Kevin Blissett and Heng Ji. 2019. Cross-lingual NIL Entity Clustering for Low-resource Languages. In Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference, pages 20–25, Minneapolis, USA. Association for Computational Linguistics.
- Cite (Informal):
- Cross-lingual NIL Entity Clustering for Low-resource Languages (Blissett & Ji, CRAC 2019)
- PDF:
- https://preview.aclanthology.org/autopr/W19-2804.pdf