Stamatis Outsios
2024
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
Iakovos Evdaimon
|
Hadi Abdine
|
Christos Xypolopoulos
|
Stamatis Outsios
|
Michalis Vazirgiannis
|
Giorgos Stamou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The era of transfer learning has revolutionized the fields of Computer Vision and Natural Language Processing, bringing powerful pretrained models with exceptional performance across a variety of tasks. Specifically, Natural Language Processing tasks have been dominated by transformer-based language models. In Natural Language Inference and Natural Language Generation tasks, the BERT model and its variants, as well as the GPT model and its successors, demonstrated exemplary performance. However, the majority of these models are pretrained and assessed primarily for the English language or on a multilingual corpus. In this paper, we introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus. We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks. In addition, we examine its performance on two NLG tasks from GreekSUM, a newly introduced summarization dataset for the Greek language. The model, the code, and the new summarization dataset will be publicly available.
2020
An Ensemble Method for Producing Word Representations focusing on the Greek Language
Michalis Lioudakis
|
Stamatis Outsios
|
Michalis Vazirgiannis
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
In this paper we present a new ensemble method, Continuous Bag-of-Skip-grams (CBOS), that produces high-quality word representations putting emphasis on the Greek language. The CBOS method combines the pioneering approaches for learning word representations: Continuous Bag-of-Words (CBOW) and Continuous Skip-gram. These methods are compared through intrinsic and extrinsic evaluation tasks on three different sources of data: the English Wikipedia corpus, the Greek Wikipedia corpus, and the Greek Web Content corpus. By comparing these methods across different tasks and datasets, it is evident that the CBOS method achieves state-of-the-art performance.
Evaluation of Greek Word Embeddings
Stamatis Outsios
|
Christos Karatsalos
|
Konstantinos Skianis
|
Michalis Vazirgiannis
Proceedings of the Twelfth Language Resources and Evaluation Conference
Since word embeddings have been the most popular input for many NLP tasks, evaluating their quality is critical. Most research efforts are focusing on English word embeddings. This paper addresses the problem of training and evaluating such models for the Greek language. We present a new word analogy test set considering the original English Word2vec analogy test set and some specific linguistic aspects of the Greek language as well. Moreover, we create a Greek version of WordSim353 test collection for a basic evaluation of word similarities. Produced resources are available for download. We test seven word vector models and our evaluation shows that we are able to create meaningful representations. Last, we discover that the morphological complexity of the Greek language and polysemy can influence the quality of the resulting word embeddings.
Search