Stella Markantonatou

Also published as: S. Markantonatou


2025

UD_Greek-GUD (GUD) is the most recent Universal Dependencies (UD) treebank for Standard Modern Greek (SMG) and the first SMG UD treebank to annotate Verbal Multiword Expressions (VMWEs). GUD contains material from fiction texts and various sites that use colloquial SMG. We describe the special annotation decisions we implemented with GUD, the pipeline we developed to facilitate the active annotation of new material, and we report on the method we designed to evaluate the performance of models trained on GUD as regards VMWE identification tasks.
Lexica of MWEs have always been a valuable resource for various NLP tasks. This paper presents the results of a comprehensive survey on multiword lexical resources that extends a previous one from 2016 to the present. We analyze a diverse set of lexica across multiple languages, reporting on aspects such as creation date, intended usage, languages covered and linguality type, content, acquisition method, accessibility, and linkage to other language resources. Our findings highlight trends in MWE lexicon development focusing on the representation level of languages. This survey aims to support future efforts in creating MWE lexica for NLP applications by identifying these gaps and opportunities.
We report on the development of the first treebank and parser for Eastern Cretan in the framework of Universal Dependencies (UD). Eastern Cretan is a living but under-resourced dialect of Modern Greek. We have worked on the transcription of oral material and relied on active annotation and knowledge transfer from GUD, a treebank of Standard Modern Greek. Along with its other phonological and morphosyntactic differences from Standard Modern Greek, Eastern Cretan (and other varieties of Modern Greek) makes heavy use of euphonics and voicing that have not been included in the UD annotation guidelines so far. We have provided annotation guidelines for East Cretan euphonics and voicing and included them in the models. Knowledge transfer from the treebank of Standard Modern Greek to the dialectal models helped to initiate annotation via an active annotation procedure
This paper presents the first treebank for the dialect of Lesbos, a low-resource living Northern variety of Modern Greek (MG), annotated according to the Universal Dependencies (UD) framework. So far, the only dialectal treebank available for Greek developed with cross-dialectal knowledge transfer is an East Cretan one, which belongs to the same Southern branch as Standard Modern Greek (SMG). Our study investigates the effectiveness of cross-dialectal knowledge transfer between dialectologically less similar varieties of the same language by leveraging knowledge from SMG to annotate the Northern dialect of Lesbos. We describe the annotation process, present the resulting treebank, inject additional linguistic knowledge to enhance the results, and evaluate the effectiveness of cross-dialectal knowledge transfer for active annotation. Our findings contribute to a better understanding of how dialectal variation within language families affects knowledge transfer in the UD framework, with implications for other low-resource varieties.

2024

Multi-word expressions (MWEs) present unique challenges in natural language processing (NLP), particularly within the context of translation systems, due to their inherent scarcity, non-compositional nature, and other distinct lexical and morphosyntactic characteristics, issues that are exacerbated in low-resource settings.In this study, we elucidate and attempt to address these challenges by leveraging a substantial corpus of human-annotated Greek MWEs. To address the complexity of translating such phrases, we propose a novel method leveraging an available out-of-context lexicon.We assess the translation capabilities of current state-of-the-art systems on this task, employing both automated metrics and human evaluators.We find that by using our method when applicable, the performance of current systems can be significantly improved, however these models are still unable to produce translations comparable to those of a human speaker.
We introduce a new corpus, named AIKIA, for Offensive Language Detection (OLD) in Modern Greek (EL). EL is a less-resourced language regarding OLD. AIKIA offers free access to annotated data leveraged from EL Twitter and fiction texts using the lexicon of offensive terms, ERIS, that originates from HurtLex. AIKIA has been annotated for offensive values with the Best Worst Scaling (BWS) method, which is designed to avoid problems of categorical and scalar annotation methods. BWS assigns continuous offensive scores in the form of floating point numbers instead of binary arithmetical or categorical values. AIKIA’s performance in OLD was tested by fine-tuning a variety of pre-trained language models in a binary classification task. Experimentation with a number of thresholds showed that the best mapping of the continuous values to binary labels should occur at the range [0.5-0.6] of BWS values and that the pre-trained models on EL data achieved the highest Macro-F1 scores. Greek-Media-BERT outperformed all models with a threshold of 0.6 by obtaining a Macro-F1 score of 0.92
We present ongoing work towards defining a lexicon-corpus interface to serve as a benchmark in the representation of multiword expressions (of various parts of speech) in dedicated lexica and the linking of these entries to their corpus occurrences. The final aim is the harnessing of such resources for the automatic identification of multiword expressions in a text. The involvement of several natural languages aims at the universality of a solution not centered on a particular language, and also accommodating idiosyncrasies. Challenges in the lexicographic description of multiword expressions are discussed, the current status of lexica dedicated to this linguistic phenomenon is outlined, as well as the solution we envisage for creating an ecosystem of interlinked lexica and corpora containing and, respectively, annotated with multiword expressions.
This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.

2023

Automatic Speech Recognition (ASR) models can aid field linguists by facilitating the creation of text corpora from oral material. Training ASR systems for low-resource languages can be a challenging task not only due to lack of resources but also due to the work required for the preparation of a training dataset. We present a pipeline for data processing and ASR model training for low-resourced languages, based on the language family. As a case study, we collected recordings of Pomak, an endangered South East Slavic language variety spoken in Greece. Using the proposed pipeline, we trained the first Pomak ASR model.
Pomak is an endangered oral Slavic language of Thrace/Greece. We present a short description of its interesting morphological and syntactic features in the UD framework. Because the morphological annotation of the treebank takes advantage of existing resources, it requires a different methodological approach from the one adopted for syntactic annotation that has started from scratch. It also requires the option of obtaining morphological predictions/evaluation separately from the syntactic ones with state-of-the-art NLP tools. Active annotation is applied in various settings in order to identify the best model that would facilitate the ongoing syntactic annotation.

2022

The project XXXX is developing a platform to enable researchers of living languages to easily create and make available state-of-the-art spoken and textual annotated resources. As a case study we use Greek and Pomak, the latter being an endangered oral Slavic language of the Balkans (including Thrace/Greece). The linguistic documentation of Pomak is an ongoing work by an interdisciplinary team in close cooperation with the Pomak community of Greece. We describe our experience in the development of a Latin-based orthography and morphologically annotated text corpora of Pomak with state-of-the-art NLP technology. These resources will be made openly available on the XXXX site and the gold annotated corpora of Pomak will be made available on the Universal Dependencies treebank repository.
UniMorph 4.0: Universal Morphology
Khuyagbaatar Batsuren | Omer Goldman | Salam Khalifa | Nizar Habash | Witold Kieraś | Gábor Bella | Brian Leonard | Garrett Nicolai | Kyle Gorman | Yustinus Ghanggo Ate | Maria Ryskina | Sabrina Mielke | Elena Budianskaya | Charbel El-Khaissi | Tiago Pimentel | Michael Gasser | William Abbott Lane | Mohit Raj | Matt Coler | Jaime Rafael Montoya Samame | Delio Siticonatzi Camaiteri | Esaú Zumaeta Rojas | Didier López Francis | Arturo Oncevay | Juan López Bautista | Gema Celeste Silva Villegas | Lucas Torroba Hennigen | Adam Ek | David Guriel | Peter Dirix | Jean-Philippe Bernardy | Andrey Scherbakov | Aziyana Bayyr-ool | Antonios Anastasopoulos | Roberto Zariquiey | Karina Sheifer | Sofya Ganieva | Hilaria Cruz | Ritván Karahóǧa | Stella Markantonatou | George Pavlidis | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Candy Angulo | Jatayu Baxi | Andrew Krizhanovsky | Natalia Krizhanovskaya | Elizabeth Salesky | Clara Vania | Sardana Ivanova | Jennifer White | Rowan Hall Maudslay | Josef Valvoda | Ran Zmigrod | Paula Czarnowska | Irene Nikkarinen | Aelita Salchak | Brijesh Bhatt | Christopher Straughn | Zoey Liu | Jonathan North Washington | Yuval Pinter | Duygu Ataman | Marcin Wolinski | Totok Suhardijanto | Anna Yablonskaya | Niklas Stoehr | Hossep Dolatian | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Aryaman Arora | Richard J. Hatcher | Ritesh Kumar | Jeremiah Young | Daria Rodionova | Anastasia Yemelina | Taras Andrushko | Igor Marchenko | Polina Mashkovtseva | Alexandra Serova | Emily Prud’hommeaux | Maria Nepomniashchaya | Fausto Giunchiglia | Eleanor Chodroff | Mans Hulden | Miikka Silfverberg | Arya D. McCarthy | David Yarowsky | Ryan Cotterell | Reut Tsarfaty | Ekaterina Vylomova
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
We present a cleansed version of the multilingual lexicon HURTLEX-(EL) comprising 737 offensive words of Modern Greek. We worked bottom-up in two annotation rounds and developed detailed guidelines by cross-classifying words on three dimensions: context, reference, and thematic domain. Our classification reveals a wider spectrum of thematic domains concerning the study of offensive language than previously thought Efthymiou et al. (2014) and reveals social and cultural aspects that are not included in the HURTLEX categories.

2020

We evaluate manually five lexical association measurements as regards the discovery of Modern Greek verb multiword expressions with two or more lexicalised components usingmwetoolkit3 (Ramisch et al., 2010). We use Twitter corpora and compare our findings with previous work on fiction corpora. The results of LL, MLE and T-score were found to overlap significantly in both the fiction and the Twitter corpora, while the results of PMI and Dice do not. We find that MWEs with two lexicalised components are more frequent in Twitter than in fiction corpora and that lean syntactic patterns help retrieve them more efficiently than richer ones. Our work (i) supports the enrichment of the lexicographical database for Modern Greek MWEs’ IDION’ (Markantonatou et al., 2019) and (ii) highlights aspects of the usage of five association measurements on specific text genres for best MWE discovery results.

2019

We report on the ongoing development of IDION, a web resource of richly documented multiword expressions (MWEs) of Modern Greek addressed to the human user and to NLP. IDION contains about 2000 verb MWEs (VMWEs) of which about 850 are fully documented as regards their syntactic flexibility, their semantics and the semantic relations with other VMWEs. Sets of synonymous MWEs are defined in a bottom-up manner revealing the conceptual organization of the MG VMWE domain.

2018

We shed light on aspects of the relation between the semantics and the syntactic flexibility of multiword expressions by investigating fixed adjective similes (FS), a predicative multiword expression class not studied in this respect before. We find that only a subset of the syntactic structures observed in the data are related with idiomaticity. We identify and measure two aspects of idiomaticity, one of which seems to allow for predictions about FS syntactic flexibility. Our research draws on a resource developed with the semantic and detailed syntactic annotation of web-retrieved Modern Greek material, indicating frequency of use of the individual similes.

2017

2016

2014

2010

2008

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2007

2006

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.

2005

In the present article, a hybrid approach is proposed for implementing a machine translation system using a large monolingual corpus coupled with a bilingual lexicon and basic NLP tools. In the first phase of the METIS system, a source language (SL) sentence, after being tagged, lemmatised and translated by a flat lemma-to-lemma lexicon, was matched against a tagged and lemmatised target language (TL) corpus using a pattern matching algorithm. In the second phase, translations are generated by combining sub-sentential structures. In this paper, the main features of the second phase are discussed while the system architecture and the corresponding translation approach are presented. The proposed methodology is illustrated with examples of the translation process.

2003

2000

1996

1993

Search
Co-authors
Fix author