Irina Lobzhanidze

2026

Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

pdf bib abs

The Megrelian Language Corpus (MLC): Creation, Annotation, and Initial Steps toward a UD Treebank
Irina Lobzhanidze | Rusudan Gersamia | Tamar Gogia
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper presents the development of the Megrelian Language Corpus (MLC), a new language resource for the documentation and computational analysis of Megrelian, an endangered Kartvelian language. The corpus is based on fieldwork conducted in Samegrelo, Georgia (2022–2024) and currently contains 97,691 tokens and 60,959 types. The data were transcribed using the International Phonetic Alphabet (IPA) and annotated in Fieldworks Language Explorer (FLEx) with segmentation, morphological analysis and bilingual Georgian-English translations. Each text is accessible through a specially designed web interface, providing multiple tiers of annotation and integrated search functions. The paper describes the corpus design, annotation methodology and challenges encountered in representing Megrelian’s complex agglutinative morphology. It also outlines initial steps toward converting existing data into the Universal Dependencies (UD) framework, building on experience from related Kartvelian languages such as Georgian. The MLC corpus represents the first publicly available linguistic resource for Megrelian and provides a foundation for future UD treebank development.

We present edition 2.0 of the PARSEME multilingual corpus annotated for multiword expressions (MWEs), resulting from efforts of the PARSEME community towards universality-driven modeling of idiomaticity. With respect to previous editions, we extend the annotation scope to all syntactic MWE categories: verbal, nominal, adjectival, adverbial and functional. We cover 17 languages, of which 7 are new. The annotation process is based on cross-lingually unified guidelines, phrased as decision diagrams over linguistic tests, and a typology of 18 MWE categories. The corpus contains almost 5 million tokens, over 250,000 sentences and 140,000 MWE annotations. The applicability of the corpus is tested in baseline experiments with a prompt-based MWE identification system. Results show that generic large language models do not encode sufficient knowledge to solve the MWE identification task.

2025

pdf bib abs

Lexica of MWEs have always been a valuable resource for various NLP tasks. This paper presents the results of a comprehensive survey on multiword lexical resources that extends a previous one from 2016 to the present. We analyze a diverse set of lexica across multiple languages, reporting on aspects such as creation date, intended usage, languages covered and linguality type, content, acquisition method, accessibility, and linkage to other language resources. Our findings highlight trends in MWE lexicon development focusing on the representation level of languages. This survey aims to support future efforts in creating MWE lexica for NLP applications by identifying these gaps and opportunities.

2024

pdf bib abs

Building a Universal Dependencies Treebank for Georgian
Irina Lobzhanidze | Erekle Magradze | Svetlana Berikashvili | Anzor Gozalishvili | Tamar Jalaghonia
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)

This paper presents the design and development of the Georgian Syntactic Treebank within the Universal Dependencies (UD) framework, addressing the unique morphosyntactic challenges ofGeorgian, a Kartvelian language. We describe the methodology for selecting andannotating 3,013 sentences from Wiki, mapping existing tagsets to the UD scheme, and converting data into the CoNLL-U format. The paper also details the training of a UDPipe model using this preliminary treebank.