A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection

Jérémy Ferrero; Frédéric Agnès; Laurent Besacier; Didier Schwab

A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection

Jérémy Ferrero, Frédéric Agnès, Laurent Besacier, Didier Schwab

Abstract

In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present preexisting corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to document), is based on both parallel and comparable corpora and contains human and machine translated texts. Moreover, it includes texts written by multiple types of authors (from average to professionals). With the obtained dataset, we conduct a systematic and rigorous evaluation of several state-of-the-art cross-language textual similarity detection methods. The evaluation results are reviewed and discussed. Finally, dataset and scripts are made publicly available on GitHub: http://github.com/FerreroJeremy/Cross-Language-Dataset.

Anthology ID:: L16-1657
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4162–4169
Language:
URL:: https://aclanthology.org/L16-1657
DOI:
Bibkey:
Cite (ACL):: Jérémy Ferrero, Frédéric Agnès, Laurent Besacier, and Didier Schwab. 2016. A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4162–4169, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection (Ferrero et al., LREC 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/ml4al-ingestion/L16-1657.pdf
Code: FerreroJeremy/Cross-Language-Dataset

PDF Search Code