Vassilis Papavassiliou


2018

pdf bib
The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task
Vassilis Papavassiliou | Sokratis Sofianopoulos | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission of the Institute for Language and Speech Processing/Athena Research and Innovation Center (ILSP/ARC) for the WMT 2018 Parallel Corpus Filtering shared task. We explore several properties of sentences and sentence pairs that our system explored in the context of the task with the purpose of clustering sentence pairs according to their appropriateness in training MT systems. We also discuss alternative methods for ranking the sentence pairs of the most appropriate clusters with the aim of generating the two datasets (of 10 and 100 million words as required in the task) that were evaluated. By summarizing the results of several experiments that were carried out by the organizers during the evaluation phase, our submission achieved an average BLEU score of 26.41, even though it does not make use of any language-specific resources like bilingual lexica, monolingual corpora, or MT output, while the average score of the best participant system was 27.91.

pdf bib
Discovering Parallel Language Resources for Training MT Engines
Vassilis Papavassiliou | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task
Vassilis Papavassiliou | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Parallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories
Prokopis Prokopidis | Vassilis Papavassiliou | Stelios Piperidis
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new collection of multilingual corpora automatically created from the content available in the Global Voices websites, where volunteers have been posting and translating citizen media stories since 2004. We describe how we crawled and processed this content to generate parallel resources comprising 302.6K document pairs and 8.36M segment alignments in 756 language pairs. For some language pairs, the segment alignments in this resource are the first open examples of their kind. In an initial use of this resource, we discuss how a set of document pair detection algorithms performs on the Greek-English corpus.

2015

pdf bib
Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling
Raphael Rubino | Tommi Pirinen | Miquel Esplà-Gomis | Nikola Ljubešić | Sergio Ortiz-Rojas | Vassilis Papavassiliou | Prokopis Prokopidis | Antonio Toral
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A. Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel L. Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib
Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites
Miquel Esplà-Gomis | Filip Klubička | Nikola Ljubešić | Sergio Ortiz-Rojas | Vassilis Papavassiliou | Prokopis Prokopidis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English―Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manually examined and the success rate was computed on the collection of pairs of documents detected by each setting. We compare the performance of the settings and the amount of different corpora detected by each setting. In addition, we describe the resource obtained, both by the settings and through the human evaluation, which has been released as a high-quality parallel corpus.

2013

pdf bib
A modular open-source focused crawler for mining monolingual and bilingual corpora from the web
Vassilis Papavassiliou | Prokopis Prokopidis | Gregor Thurmair
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

pdf bib
Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study
Pavel Pecina | Antonio Toral | Vassilis Papavassiliou | Prokopis Prokopidis | Josef van Genabith
Proceedings of the 16th Annual conference of the European Association for Machine Translation

2011

pdf bib
Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation
Pavel Pecina | Antonio Toral | Andy Way | Vassilis Papavassiliou | Prokopis Prokopidis | Maria Giagkou
Proceedings of the 15th Annual conference of the European Association for Machine Translation