Dimitrios Roussis
2022
The ARC-NKUA Submission for the English-Ukrainian General Machine Translation Shared Task at WMT22
Dimitrios Roussis
|
Vassilis Papavassiliou
Proceedings of the Seventh Conference on Machine Translation (WMT)
The ARC-NKUA (“Athena” Research Center - National and Kapodistrian University of Athens) submission to the WMT22 General Machine Translation shared task concerns the unconstrained tracks of the English-Ukrainian and Ukrainian-English translation directions. The two Neural Machine Translation systems are based on Transformer models and our primary submissions were determined through experimentation with (a) ensemble decoding, (b) selected fine-tuning with a subset of the training data, (c) data augmentation with back-translated monolingual data, and (d) post-processing of the translation outputs. Furthermore, we discuss filtering techniques and the acquisition of additional data used for training the systems.
Constructing Parallel Corpora from COVID-19 News using MediSys Metadata
Dimitrios Roussis
|
Vassilis Papavassiliou
|
Sokratis Sofianopoulos
|
Prokopis Prokopidis
|
Stelios Piperidis
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents a collection of parallel corpora generated by exploiting the COVID-19 related dataset of metadata created with the Europe Media Monitor (EMM) / Medical Information System (MediSys) processing chain of news articles. We describe how we constructed comparable monolingual corpora of news articles related to the current pandemic and used them to mine about 11.2 million segment alignments in 26 EN-X language pairs, covering most official EU languages plus Albanian, Arabic, Icelandic, Macedonian, and Norwegian. Subsets of this collection have been used in shared tasks (e.g. Multilingual Semantic Search, Machine Translation) aimed at accelerating the creation of resources and tools needed to facilitate access to information in the COVID-19 emergency situation.
SciPar: A Collection of Parallel Corpora from Scientific Abstracts
Dimitrios Roussis
|
Vassilis Papavassiliou
|
Prokopis Prokopidis
|
Stelios Piperidis
|
Vassilis Katsouros
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents SciPar, a new collection of parallel corpora created from openly available metadata of bachelor theses, master theses and doctoral dissertations hosted in institutional repositories, digital libraries of universities and national archives. We describe first how we harvested and processed metadata from 86, mainly European, repositories to extract bilingual titles and abstracts, and then how we mined high quality sentence pairs in a wide range of scientific areas and sub-disciplines. In total, the resource includes 9.17 million segment alignments in 31 language pairs and is publicly available via the ELRC-SHARE repository. The bilingual corpora in this collection could prove valuable in various applications, such as cross-lingual plagiarism detection or adapting Machine Translation systems for the translation of scientific texts and academic writing in general, especially for language pairs which include English.