Abstract
Text comparison is an interesting though hard task, with many applications in Natural Language Processing. This work introduces a new text-similarity measure, which employs named-entities’ information extracted from the texts and the n-gram graphs’ model for representing documents. Using OpenCalais as a named-entity recognition service and the JINSECT toolkit for constructing and managing n-gram graphs, the text similarity measure is embedded in a text clustering algorithm (k-Means). The evaluation of the produced clusters with various clustering validity metrics shows that the extraction of named entities at a first step can be profitable for the time-performance of similarity measures that are based on the n-gram graph representation without affecting the overall performance of the NLP task.- Anthology ID:
- R17-1098
- Volume:
- Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
- Month:
- September
- Year:
- 2017
- Address:
- Varna, Bulgaria
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 765–771
- Language:
- URL:
- https://doi.org/10.26615/978-954-452-049-6_098
- DOI:
- 10.26615/978-954-452-049-6_098
- Cite (ACL):
- Leonidas Tsekouras, Iraklis Varlamis, and George Giannakopoulos. 2017. A Graph-based Text Similarity Measure That Employs Named Entity Information. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 765–771, Varna, Bulgaria. INCOMA Ltd..
- Cite (Informal):
- A Graph-based Text Similarity Measure That Employs Named Entity Information (Tsekouras et al., RANLP 2017)
- PDF:
- https://doi.org/10.26615/978-954-452-049-6_098