Jérémy Ferrero

2017

pdf
Amélioration de la similarité sémantique vectorielle par méthodes non-supervisées (Improved the Semantic Similarity with Weighting Vectors)
El-Moatez-Billah Nagoudi | Jérémy Ferrero | Didier Schwab
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

pdf abs
Using Word Embedding for Cross-Language Plagiarism Detection
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

pdf bib abs
Deep Investigation of Cross-Language Plagiarism Detection Methods
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw robust conclusions on the best methods while deeply analyzing correlations across document styles and languages.

pdf abs
CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in unsupervised and supervised way. Our best run ranked 1st on track 4a with a correlation of 83.02% with human annotations.

pdf abs
LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting
El Moatez Billah Nagoudi | Jérémy Ferrero | Didier Schwab
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This article describes our proposed system named LIM-LIG. This system is designed for SemEval 2017 Task1: Semantic Textual Similarity (Track1). LIM-LIG proposes an innovative enhancement to word embedding-based model devoted to measure the semantic similarity in Arabic sentences. The main idea is to exploit the word representations as vectors in a multidimensional space to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. LIM-LIG system achieves a Pearson’s correlation of 0.74633, ranking 2nd among all participants in the Arabic monolingual pairs STS task organized within the SemEval 2017 evaluation campaign

2016

pdf abs
A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection
Jérémy Ferrero | Frédéric Agnès | Laurent Besacier | Didier Schwab
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present preexisting corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to document), is based on both parallel and comparable corpora and contains human and machine translated texts. Moreover, it includes texts written by multiple types of authors (from average to professionals). With the obtained dataset, we conduct a systematic and rigorous evaluation of several state-of-the-art cross-language textual similarity detection methods. The evaluation results are reviewed and discussed. Finally, dataset and scripts are made publicly available on GitHub: http://github.com/FerreroJeremy/Cross-Language-Dataset.

2015

pdf bib abs
fr2sql : Interrogation de bases de données en français
Benoît Couderc | Jérémy Ferrero
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

Les bases de données sont de plus en plus courantes et prennent de plus en plus d’ampleur au sein des applications et sites Web actuels. Elles sont souvent amenées à être utilisées par des personnes n’ayant pas une grande compétence en la matière et ne connaissant pas rigoureusement leur structure. C’est pour cette raison que des traducteurs du langage naturel aux requêtes SQL sont développés. Malheureusement, la plupart de ces traducteurs se cantonnent à une seule base du fait de la spécificité de l’architecture de celle-ci. Dans cet article, nous proposons une méthode visant à pouvoir interroger n’importe quelle base de données à partir de questions en français. Nous évaluons notre application sur deux bases à la structure différente et nous montrons également qu’elle supporte plus d’opérations que la plupart des autres traducteurs.