Sokratis Sofianopoulos

Also published as: Sokratis Sofianopoulos........

2025

We introduce Llama-Krikri-8B, a cutting-edge Large Language Model tailored for the Greek language, built on Meta’s Llama 3.1-8B. Llama-Krikri-8B has been extensively trained on high-quality Greek data to ensure superior adaptation to linguistic nuances. With 8 billion parameters, it offers advanced capabilities while maintaining efficient computational performance. Llama-Krikri-8B supports both Modern Greek and English, and is also equipped to handle polytonic text and Ancient Greek. The chat version of Llama-Krikri-8B features a multi-stage post-training pipeline, utilizing both human and synthetic instruction and preference data, by applying techniques such as MAGPIE. In addition, for evaluation, we propose three novel public benchmarks for Greek. Our evaluation on existing as well as the proposed benchmarks shows notable improvements over comparable Greek and multilingual LLMs in both natural language understanding and generation as well as code generation.

2024

pdf bib abs

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
Dimitris Roussis | Sokratis Sofianopoulos | Stelios Piperidis
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora from the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the research domains of: Energy Research, Neuroscience, Cancer and Transportation. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

2022

pdf bib abs

Constructing Parallel Corpora from COVID-19 News using MediSys Metadata
Dimitrios Roussis | Vassilis Papavassiliou | Sokratis Sofianopoulos | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents a collection of parallel corpora generated by exploiting the COVID-19 related dataset of metadata created with the Europe Media Monitor (EMM) / Medical Information System (MediSys) processing chain of news articles. We describe how we constructed comparable monolingual corpora of news articles related to the current pandemic and used them to mine about 11.2 million segment alignments in 26 EN-X language pairs, covering most official EU languages plus Albanian, Arabic, Icelandic, Macedonian, and Norwegian. Subsets of this collection have been used in shared tasks (e.g. Multilingual Semantic Search, Machine Translation) aimed at accelerating the creation of resources and tools needed to facilitate access to information in the COVID-19 emergency situation.

pdf bib abs

Welocalize-ARC/NKUA’s Submission to the WMT 2022 Quality Estimation Shared Task
Eirini Zafeiridou | Sokratis Sofianopoulos
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents our submission to the WMT 2022 quality estimation shared task and more specifically to the quality prediction sentence-level direct assessment (DA) subtask. We build a multilingual system based on the predictor–estimator architecture by using the XLM-RoBERTa transformer for feature extraction and a regression head on top of the final model to estimate the z-standardized DA labels. Furthermore, we use pretrained models to extract useful knowledge that reflect various criteria of quality assessment and demonstrate good correlation with human judgements. We optimize the performance of our model by incorporating this information as additional external features in the input data and by applying Monte Carlo dropout during both training and inference.

2018

pdf bib abs

The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task
Vassilis Papavassiliou | Sokratis Sofianopoulos | Prokopis Prokopidis | Stelios Piperidis
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission of the Institute for Language and Speech Processing/Athena Research and Innovation Center (ILSP/ARC) for the WMT 2018 Parallel Corpus Filtering shared task. We explore several properties of sentences and sentence pairs that our system explored in the context of the task with the purpose of clustering sentence pairs according to their appropriateness in training MT systems. We also discuss alternative methods for ranking the sentence pairs of the most appropriate clusters with the aim of generating the two datasets (of 10 and 100 million words as required in the task) that were evaluated. By summarizing the results of several experiments that were carried out by the organizers during the evaluation phase, our submission achieved an average BLEU score of 26.41, even though it does not make use of any language-specific resources like bilingual lexica, monolingual corpora, or MT output, while the average score of the best participant system was 27.91.

Evaluating the Translation Accuracy of a Novel Language-Independent MT Methodology
George Tambouratzis | Sokratis Sofianopoulos | Marina Vassiliou
Proceedings of COLING 2012

pdf bib

PRESEMT: Pattern Recognition-based Statistically Enhanced MT
George Tambouratzis | Marina Vassiliou | Sokratis Sofianopoulos
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib

Implementing a Language-Independent MT Methodology
Sokratis Sofianopoulos | Marina Vassiliou | George Tambouratzis
Proceedings of the First Workshop on Multilingual Modeling

2011

pdf bib

A resource-light phrase scheme for language-portable MT
George Tambouratzis | Fotini Simistira | Sokratis Sofianopoulos | Nikos Tsimboukakis | Marina Vassiliou
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2008

pdf bib abs

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2007

pdf bib

Demonstration of the Greek to English METIS-II system
Sokratis Sofianopoulos | Vassiliki Spilioti | Marina Vassiliou | Olga Yannoutsou | Stella Markantonatou
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2006

pdf bib

Using Patterns for Machine Translation
Stella Makantonatou | Sokratis Sofianopoulos | Vassiliki Spilioti | George Tambouratzis | Marina Vassiliou | Olga Yannoutsou
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

2005

pdf bib abs

In the present article, a hybrid approach is proposed for implementing a machine translation system using a large monolingual corpus coupled with a bilingual lexicon and basic NLP tools. In the first phase of the METIS system, a source language (SL) sentence, after being tagged, lemmatised and translated by a flat lemma-to-lemma lexicon, was matched against a tagged and lemmatised target language (TL) corpus using a pattern matching algorithm. In the second phase, translations are generated by combining sub-sentential structures. In this paper, the main features of the second phase are discussed while the system architecture and the corresponding translation approach are presented. The proposed methodology is illustrated with examples of the translation process.