Giannis Nikolentzos
2017
Shortest-Path Graph Kernels for Document Similarity
Giannis Nikolentzos
|
Polykarpos Meladianos
|
François Rousseau
|
Yannis Stavrakas
|
Michalis Vazirgiannis
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
In this paper, we present a novel document similarity measure based on the definition of a graph kernel between pairs of documents. The proposed measure takes into account both the terms contained in the documents and the relationships between them. By representing each document as a graph-of-words, we are able to model these relationships and then determine how similar two documents are by using a modified shortest-path graph kernel. We evaluate our approach on two tasks and compare it against several baseline approaches using various performance metrics such as DET curves and macro-average F1-score. Experimental results on a range of datasets showed that our proposed approach outperforms traditional techniques and is capable of measuring more accurately the similarity between two documents.
Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization
Giannis Nikolentzos
|
Polykarpos Meladianos
|
François Rousseau
|
Yannis Stavrakas
|
Michalis Vazirgiannis
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
Recently, there has been a lot of activity in learning distributed representations of words in vector spaces. Although there are models capable of learning high-quality distributed representations of words, how to generate vector representations of the same quality for phrases or documents still remains a challenge. In this paper, we propose to model each document as a multivariate Gaussian distribution based on the distributed representations of its words. We then measure the similarity between two documents based on the similarity of their distributions. Experiments on eight standard text categorization datasets demonstrate the effectiveness of the proposed approach in comparison with state-of-the-art methods.