Stella Markantonatou

Also published as: S. Markantonatou


UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology
Agata Savary | Daniel Zeman | Verginica Barbu Mititelu | Anabela Barreiro | Olesea Caftanatov | Marie-Catherine de Marneffe | Kaja Dobrovoljc | Gülşen Eryiğit | Voula Giouli | Bruno Guillaume | Stella Markantonatou | Nurit Melnik | Joakim Nivre | Atul Kr. Ojha | Carlos Ramisch | Abigail Walsh | Beata Wójtowicz | Alina Wróblewska
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.

Dictionary-Aided Translation for Handling Multi-Word Expressions in Low-Resource Languages
Antonios Dimakis | Stella Markantonatou | Antonios Anastasopoulos
Findings of the Association for Computational Linguistics ACL 2024

Multi-word expressions (MWEs) present unique challenges in natural language processing (NLP), particularly within the context of translation systems, due to their inherent scarcity, non-compositional nature, and other distinct lexical and morphosyntactic characteristics, issues that are exacerbated in low-resource settings.In this study, we elucidate and attempt to address these challenges by leveraging a substantial corpus of human-annotated Greek MWEs. To address the complexity of translating such phrases, we propose a novel method leveraging an available out-of-context lexicon.We assess the translation capabilities of current state-of-the-art systems on this task, employing both automated metrics and human evaluators.We find that by using our method when applicable, the performance of current systems can be significantly improved, however these models are still unable to produce translations comparable to those of a human speaker.

Multiword Expressions between the Corpus and the Lexicon: Universality, Idiosyncrasy, and the Lexicon-Corpus Interface
Verginica Barbu Mititelu | Voula Giouli | Kilian Evang | Daniel Zeman | Petya Osenova | Carole Tiberius | Simon Krek | Stella Markantonatou | Ivelina Stoyanova | Ranka Stanković | Christian Chiarcos
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

We present ongoing work towards defining a lexicon-corpus interface to serve as a benchmark in the representation of multiword expressions (of various parts of speech) in dedicated lexica and the linking of these entries to their corpus occurrences. The final aim is the harnessing of such resources for the automatic identification of multiword expressions in a text. The involvement of several natural languages aims at the universality of a solution not centered on a particular language, and also accommodating idiosyncrasies. Challenges in the lexicographic description of multiword expressions are discussed, the current status of lexica dedicated to this linguistic phenomenon is outlined, as well as the solution we envisage for creating an ecosystem of interlinked lexica and corpora containing and, respectively, annotated with multiword expressions.

The Corpus AIKIA: Using Ranking Annotation for Offensive Language Detection in Modern Greek
Stella Markantonatou | Vivian Stamou | Christina Christodoulou | Georgia Apostolopoulou | Antonis Balas | George Ioannakis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce a new corpus, named AIKIA, for Offensive Language Detection (OLD) in Modern Greek (EL). EL is a less-resourced language regarding OLD. AIKIA offers free access to annotated data leveraged from EL Twitter and fiction texts using the lexicon of offensive terms, ERIS, that originates from HurtLex. AIKIA has been annotated for offensive values with the Best Worst Scaling (BWS) method, which is designed to avoid problems of categorical and scalar annotation methods. BWS assigns continuous offensive scores in the form of floating point numbers instead of binary arithmetical or categorical values. AIKIA’s performance in OLD was tested by fine-tuning a variety of pre-trained language models in a binary classification task. Experimentation with a number of thresholds showed that the best mapping of the continuous values to binary labels should occur at the range [0.5-0.6] of BWS values and that the pre-trained models on EL data achieved the highest Macro-F1 scores. Greek-Media-BERT outperformed all models with a threshold of 0.6 by obtaining a Macro-F1 score of 0.92


ASR pipeline for low-resourced languages: A case study on Pomak
Chara Tsoukala | Kosmas Kritsis | Ioannis Douros | Athanasios Katsamanis | Nikolaos Kokkas | Vasileios Arampatzakis | Vasileios Sevetlidis | Stella Markantonatou | George Pavlidis
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

Automatic Speech Recognition (ASR) models can aid field linguists by facilitating the creation of text corpora from oral material. Training ASR systems for low-resource languages can be a challenging task not only due to lack of resources but also due to the work required for the preparation of a training dataset. We present a pipeline for data processing and ASR model training for low-resourced languages, based on the language family. As a case study, we collected recordings of Pomak, an endangered South East Slavic language variety spoken in Greece. Using the proposed pipeline, we trained the first Pomak ASR model.

Methodological issues regarding the semi-automatic UD treebank creation of under-resourced languages: the case of Pomak
Stella Markantonatou | Nicolaos Th. Constantinides | Vivian Stamou | Vasileios Arampatzakis | Panagiotis G. Krimpas | George Pavlidis
Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)

Pomak is an endangered oral Slavic language of Thrace/Greece. We present a short description of its interesting morphological and syntactic features in the UD framework. Because the morphological annotation of the treebank takes advantage of existing resources, it requires a different methodological approach from the one adopted for syntactic annotation that has started from scratch. It also requires the option of obtaining morphological predictions/evaluation separately from the syntactic ones with state-of-the-art NLP tools. Active annotation is applied in various settings in order to identify the best model that would facilitate the ongoing syntactic annotation.


Cleansing & expanding the HURTLEX(el) with a multidimensional categorization of offensive words
Vivian Stamou | Iakovi Alexiou | Antigone Klimi | Eleftheria Molou | Alexandra Saivanidou | Stella Markantonatou
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

We present a cleansed version of the multilingual lexicon HURTLEX-(EL) comprising 737 offensive words of Modern Greek. We worked bottom-up in two annotation rounds and developed detailed guidelines by cross-classifying words on three dimensions: context, reference, and thematic domain. Our classification reveals a wider spectrum of thematic domains concerning the study of offensive language than previously thought Efthymiou et al. (2014) and reveals social and cultural aspects that are not included in the HURTLEX categories.

Morphologically annotated corpora of Pomak
Ritván Jusúf Karahóǧa | Panagiotis G. Krimpas | Vivian Stamou | Vasileios Arampatzakis | Dimitrios Karamatskos | Vasileios Sevetlidis | Nikolaos Constantinides | Nikolaos Kokkas | George Pavlidis | Stella Markantonatou
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

The project XXXX is developing a platform to enable researchers of living languages to easily create and make available state-of-the-art spoken and textual annotated resources. As a case study we use Greek and Pomak, the latter being an endangered oral Slavic language of the Balkans (including Thrace/Greece). The linguistic documentation of Pomak is an ongoing work by an interdisciplinary team in close cooperation with the Pomak community of Greece. We describe our experience in the development of a Latin-based orthography and morphologically annotated text corpora of Pomak with state-of-the-art NLP technology. These resources will be made openly available on the XXXX site and the gold annotated corpora of Pomak will be made available on the Universal Dependencies treebank repository.

UniMorph 4.0: Universal Morphology
Khuyagbaatar Batsuren | Omer Goldman | Salam Khalifa | Nizar Habash | Witold Kieraś | Gábor Bella | Brian Leonard | Garrett Nicolai | Kyle Gorman | Yustinus Ghanggo Ate | Maria Ryskina | Sabrina Mielke | Elena Budianskaya | Charbel El-Khaissi | Tiago Pimentel | Michael Gasser | William Abbott Lane | Mohit Raj | Matt Coler | Jaime Rafael Montoya Samame | Delio Siticonatzi Camaiteri | Esaú Zumaeta Rojas | Didier López Francis | Arturo Oncevay | Juan López Bautista | Gema Celeste Silva Villegas | Lucas Torroba Hennigen | Adam Ek | David Guriel | Peter Dirix | Jean-Philippe Bernardy | Andrey Scherbakov | Aziyana Bayyr-ool | Antonios Anastasopoulos | Roberto Zariquiey | Karina Sheifer | Sofya Ganieva | Hilaria Cruz | Ritván Karahóǧa | Stella Markantonatou | George Pavlidis | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Candy Angulo | Jatayu Baxi | Andrew Krizhanovsky | Natalia Krizhanovskaya | Elizabeth Salesky | Clara Vania | Sardana Ivanova | Jennifer White | Rowan Hall Maudslay | Josef Valvoda | Ran Zmigrod | Paula Czarnowska | Irene Nikkarinen | Aelita Salchak | Brijesh Bhatt | Christopher Straughn | Zoey Liu | Jonathan North Washington | Yuval Pinter | Duygu Ataman | Marcin Wolinski | Totok Suhardijanto | Anna Yablonskaya | Niklas Stoehr | Hossep Dolatian | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Aryaman Arora | Richard J. Hatcher | Ritesh Kumar | Jeremiah Young | Daria Rodionova | Anastasia Yemelina | Taras Andrushko | Igor Marchenko | Polina Mashkovtseva | Alexandra Serova | Emily Prud’hommeaux | Maria Nepomniashchaya | Fausto Giunchiglia | Eleanor Chodroff | Mans Hulden | Miikka Silfverberg | Arya D. McCarthy | David Yarowsky | Ryan Cotterell | Reut Tsarfaty | Ekaterina Vylomova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.


pdf bib
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
Stella Markantonatou | John McCrae | Jelena Mitrović | Carole Tiberius | Carlos Ramisch | Ashwini Vaidya | Petya Osenova | Agata Savary
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

VMWE discovery: a comparative analysis between Literature and Twitter Corpora
Vivian Stamou | Artemis Xylogianni | Marilena Malli | Penny Takorou | Stella Markantonatou
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

We evaluate manually five lexical association measurements as regards the discovery of Modern Greek verb multiword expressions with two or more lexicalised components usingmwetoolkit3 (Ramisch et al., 2010). We use Twitter corpora and compare our findings with previous work on fiction corpora. The results of LL, MLE and T-score were found to overlap significantly in both the fiction and the Twitter corpora, while the results of PMI and Dice do not. We find that MWEs with two lexicalised components are more frequent in Twitter than in fiction corpora and that lean syntactic patterns help retrieve them more efficiently than richer ones. Our work (i) supports the enrichment of the lexicographical database for Modern Greek MWEs’ IDION’ (Markantonatou et al., 2019) and (ii) highlights aspects of the usage of five association measurements on specific text genres for best MWE discovery results.


IDION: A database for Modern Greek multiword expressions
Stella Markantonatou | Panagiotis Minos | George Zakis | Vassiliki Moutzouri | Maria Chantou
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

We report on the ongoing development of IDION, a web resource of richly documented multiword expressions (MWEs) of Modern Greek addressed to the human user and to NLP. IDION contains about 2000 verb MWEs (VMWEs) of which about 850 are fully documented as regards their syntactic flexibility, their semantics and the semantic relations with other VMWEs. Sets of synonymous MWEs are defined in a bottom-up manner revealing the conceptual organization of the MG VMWE domain.


Fixed Similes: Measuring aspects of the relation between MWE idiomatic semantics and syntactic flexibility
Stella Markantonatou | Panagiotis Kouris | Yanis Maistros
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

We shed light on aspects of the relation between the semantics and the syntactic flexibility of multiword expressions by investigating fixed adjective similes (FS), a predicative multiword expression class not studied in this respect before. We find that only a subset of the syntactic structures observed in the data are related with idiomaticity. We identify and measure two aspects of idiomaticity, one of which seems to allow for predictions about FS syntactic flexibility. Our research draws on a resource developed with the semantic and detailed syntactic annotation of web-retrieved Modern Greek material, indicating frequency of use of the individual similes.


pdf bib
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Stella Markantonatou | Carlos Ramisch | Agata Savary | Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)


pdf bib
Proceedings of the 12th Workshop on Multiword Expressions
Valia Kordoni | Kostadin Cholakov | Markus Egg | Stella Markantonatou | Preslav Nakov
Proceedings of the 12th Workshop on Multiword Expressions


Parsing Modern Greek verb MWEs with LFG/XLE grammars
Niki Samaridi | Stella Markantonatou
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

Encoding MWEs in a conceptual lexicon
Aggeliki Fotopoulou | Stella Markantonatou | Voula Giouli
Proceedings of the 10th Workshop on Multiword Expressions (MWE)


In Search of the ’Right’ Word
Stella Markantonatou | Aggeliki Fotopoulou | Maria Alexopoulou | Marianna Mini
Proceedings of the 2nd Workshop on Cognitive Aspects of the Lexicon


Evaluation of a Machine Translation System for Low Resource Languages: METIS-II
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman | Stella Markantonatou | Sokratis Sofianopoulos | Marina Vassiliou | Olga Yannoutsou | Toni Badia | Maite Melero | Gemma Boleda | Michael Carl | Paul Schmidt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.


Demonstration of the Greek to English METIS-II system
Sokratis Sofianopoulos | Vassiliki Spilioti | Marina Vassiliou | Olga Yannoutsou | Stella Markantonatou
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers


METIS-II: Machine Translation for Low Resource Languages
Vincent Vandeghinste | Ineke Schuurman | Michael Carl | Stella Markantonatou | Toni Badia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.


Monolingual Corpus-based MT Using Chunks
Stella Markantonatou | Sokratis Sofianopoulos | Vassiliki Spilioti | Yiorgos Tambouratzis | Marina Vassiliou | Olga Yannoutsou | Nikos Ioannou
Workshop on example-based machine translation

In the present article, a hybrid approach is proposed for implementing a machine translation system using a large monolingual corpus coupled with a bilingual lexicon and basic NLP tools. In the first phase of the METIS system, a source language (SL) sentence, after being tagged, lemmatised and translated by a flat lemma-to-lemma lexicon, was matched against a tagged and lemmatised target language (TL) corpus using a pattern matching algorithm. In the second phase, translations are generated by combining sub-sentential structures. In this paper, the main features of the second phase are discussed while the system architecture and the corresponding translation approach are presented. The proposed methodology is illustrated with examples of the translation process.


Using monolingual corpora for statistical machine translation: the METIS system
Yannis Dologlou | Stella Markantonatou | George Tambouratzis | Olga Yannoutsou | Athanassia Fourla | Nikos Iannou
EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT

Evaluating specifications for controlled Greek
Marina Vassiliou | Stella Markantonatou | Yanis Maistros | Vangelis Karkaletsis
EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT


Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
M. Gavrilidou | G. Carayannis | S. Markantonatou | S. Piperidis | G. Stainhauer
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Automatic Style Categorisation of Corpora in the Greek Language
George Tambouratzis | Stella Markantonatou | Nikolaos Hairetakis | George Carayannis
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Discriminating the registers and styles in the Modern Greek language
George Tambouratzis | Stella Markantonatou | Nikolaos Hairetakis | Marina Vassiliou | Dimitrios Tambouratzis | George Carayannis
The Workshop on Comparing Corpora


Lexical Rules: What are they?
Andrew Bredenkamp | Stella Markantonatou | Louisa Sadler
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics


Experiments in Reusability of Grammatical Resources
Doug Arnold | Toni Badia | Josef van Genabith | Stella Markantonatou | Stefan Momma | Louisa Sadler | Paul Schmidt
Sixth Conference of the European Chapter of the Association for Computational Linguistics
