Prokopis Prokopidis

2022

pdf abs
Constructing Parallel Corpora from COVID-19 News using MediSys Metadata
Dimitrios Roussis | Vassilis Papavassiliou | Sokratis Sofianopoulos | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents a collection of parallel corpora generated by exploiting the COVID-19 related dataset of metadata created with the Europe Media Monitor (EMM) / Medical Information System (MediSys) processing chain of news articles. We describe how we constructed comparable monolingual corpora of news articles related to the current pandemic and used them to mine about 11.2 million segment alignments in 26 EN-X language pairs, covering most official EU languages plus Albanian, Arabic, Icelandic, Macedonian, and Norwegian. Subsets of this collection have been used in shared tasks (e.g. Multilingual Semantic Search, Machine Translation) aimed at accelerating the creation of resources and tools needed to facilitate access to information in the COVID-19 emergency situation.

pdf abs
SciPar: A Collection of Parallel Corpora from Scientific Abstracts
Dimitrios Roussis | Vassilis Papavassiliou | Prokopis Prokopidis | Stelios Piperidis | Vassilis Katsouros
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents SciPar, a new collection of parallel corpora created from openly available metadata of bachelor theses, master theses and doctoral dissertations hosted in institutional repositories, digital libraries of universities and national archives. We describe first how we harvested and processed metadata from 86, mainly European, repositories to extract bilingual titles and abstracts, and then how we mined high quality sentence pairs in a wide range of scientific areas and sub-disciplines. In total, the resource includes 9.17 million segment alignments in 31 language pairs and is publicly available via the ELRC-SHARE repository. The bilingual corpora in this collection could prove valuable in various applications, such as cross-lingual plagiarism detection or adapting Machine Translation systems for the translation of scientific texts and academic writing in general, especially for language pairs which include English.

2021

pdf
Asia Minor Greek in Contact (AMGiC): Towards a dialectal Treebank comprising contact-induced grammatical changes.
Konstantinos Sampanis | Prokopis Prokopidis
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

2018

pdf
Discovering Parallel Language Resources for Training MT Engines
Vassilis Papavassiliou | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task
Vassilis Papavassiliou | Sokratis Sofianopoulos | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission of the Institute for Language and Speech Processing/Athena Research and Innovation Center (ILSP/ARC) for the WMT 2018 Parallel Corpus Filtering shared task. We explore several properties of sentences and sentence pairs that our system explored in the context of the task with the purpose of clustering sentence pairs according to their appropriateness in training MT systems. We also discuss alternative methods for ranking the sentence pairs of the most appropriate clusters with the aim of generating the two datasets (of 10 and 100 million words as required in the task) that were evaluated. By summarizing the results of several experiments that were carried out by the organizers during the evaluation phase, our submission achieved an average BLEU score of 26.41, even though it does not make use of any language-specific resources like bilingual lexica, monolingual corpora, or MT output, while the average score of the best participant system was 27.91.

2017

pdf
Universal Dependencies for Greek
Prokopis Prokopidis | Haris Papageorgiou
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

2016

pdf
Abu-MaTran: automatic building of machine translation
Antonio Toral | Sergio Ortiz Rojas | Mikel Forcada | Nikola Lubesic | Prokopis Prokopidis
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf abs
Parallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories
Prokopis Prokopidis | Vassilis Papavassiliou | Stelios Piperidis
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new collection of multilingual corpora automatically created from the content available in the Global Voices websites, where volunteers have been posting and translating citizen media stories since 2004. We describe how we crawled and processed this content to generate parallel resources comprising 302.6K document pairs and 8.36M segment alignments in 756 language pairs. For some language pairs, the segment alignments in this resource are the first open examples of their kind. In an initial use of this resource, we discuss how a set of document pair detection algorithms performs on the Greek-English corpus.

pdf
The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task
Vassilis Papavassiliou | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

2014

pdf
Experiments for Dependency Parsing of Greek
Prokopis Prokopidis | Haris Papageorgiou
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

pdf abs
Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites
Miquel Esplà-Gomis | Filip Klubička | Nikola Ljubešić | Sergio Ortiz-Rojas | Vassilis Papavassiliou | Prokopis Prokopidis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English―Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manually examined and the success rate was computed on the collection of pairs of documents detected by each setting. We compare the performance of the settings and the amount of different corpora detected by each setting. In addition, we describe the resource obtained, both by the settings and through the human evaluation, which has been released as a high-quality parallel corpus.

Text condensation aims at shortening the length of an utterance without losing essential textual information. In this paper, we report on the implementation and preliminary evaluation of a sentence condensation tool for Greek using a manually constructed table of 450 lexical paraphrases, and a set of rules that delete syntactic subtrees that carry minor semantic information. Evaluation on two-sentence sets show promising results regarding grammaticality and semantic acceptability of compressed versions.

2006

pdf abs
Adding multi-layer semantics to the Greek Dependency Treebank
Harris Papageorgiou | Elina Desipri | Maria Koutsombogera | Kanella Pouli | Prokopis Prokopidis
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we give an overview of the approach adopted to add a layer of semantic information to the Greek Dependency Treebank [GDT]. Our ultimate goal is to come up with a large corpus, reliably annotated with rich semantic structures. To this end, a corpus has been compiled encompassing various data sources and domains. This collection has been preprocessed, annotated and validated on the basis of dependency representation. Taking into account multi-layered annotation schemes designed to provide deeper representations of structure and meaning, we describe the methodology followed as regards the semantic layer, we report on the annotation process and the problems faced and we conclude with comments on future work and exploitation of the resulting resource.