Olha Kanishcheva
2026
Ukrainian Multiword Expressions Corpus: Creation, Annotation, and Linguistic Analysis
Hanna Sytar | Maria Shvedova | Olha Kanishcheva
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Hanna Sytar | Maria Shvedova | Olha Kanishcheva
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
This paper presents the development of a corpus of annotated multiword expressions (MWEs) for Ukrainian. The resource covers four major categories of MWEs: verbal, nominal, adjectival/adverbial, and functional. We describe the methodology used for data selection, the annotation scheme, and the procedures employed during annotation. In addition, the paper discusses some specific types of MWE constructions, illustrating their usage with numerous examples and addressing complex and borderline cases. The resulting corpus is an important resource for linguistic studies and NLP tasks involving MWEs, and is publicly accessible https://gitlab.com/parseme/sharedtask-data/-/tree/master/2.0?ref_type=heads.
PARSEME 2.0 Multilingual Corpus of Multiword Expressions
Agata Savary | Manon Scholivet | Carlos Ramisch | Takuya Nakamura | Eric Bilinski | Sara Stymne | Voula Giouli | Stella Markantonatou | Vasile Pais | Maria Mitrofan | Louis Estève | Bruno Guillaume | Verginica Barbu Mititelu | Jaka Čibej | Roberto Díaz Hernández | Victoria Fendel | Polona Gantar | Olha Kanishcheva | Cvetana Krstev | Chaya Liebeskind | Irina Lobzhanidze | Aleksandra M. Marković | Gunta Nešpore-Bērzkalne | Adriana S. Pagano | Mehrnoush Shamsfard | Ranka Stankovic | Vahide Tajalli | Carole Tiberius | Aakanksha Padhye
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Agata Savary | Manon Scholivet | Carlos Ramisch | Takuya Nakamura | Eric Bilinski | Sara Stymne | Voula Giouli | Stella Markantonatou | Vasile Pais | Maria Mitrofan | Louis Estève | Bruno Guillaume | Verginica Barbu Mititelu | Jaka Čibej | Roberto Díaz Hernández | Victoria Fendel | Polona Gantar | Olha Kanishcheva | Cvetana Krstev | Chaya Liebeskind | Irina Lobzhanidze | Aleksandra M. Marković | Gunta Nešpore-Bērzkalne | Adriana S. Pagano | Mehrnoush Shamsfard | Ranka Stankovic | Vahide Tajalli | Carole Tiberius | Aakanksha Padhye
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present edition 2.0 of the PARSEME multilingual corpus annotated for multiword expressions (MWEs), resulting from efforts of the PARSEME community towards universality-driven modeling of idiomaticity. With respect to previous editions, we extend the annotation scope to all syntactic MWE categories: verbal, nominal, adjectival, adverbial and functional. We cover 17 languages, of which 7 are new. The annotation process is based on cross-lingually unified guidelines, phrased as decision diagrams over linguistic tests, and a typology of 18 MWE categories. The corpus contains almost 5 million tokens, over 250,000 sentences and 140,000 MWE annotations. The applicability of the corpus is tested in baseline experiments with a prompt-based MWE identification system. Results show that generic large language models do not encode sufficient knowledge to solve the MWE identification task.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
2023
The Parliamentary Code-Switching Corpus: Bilingualism in the Ukrainian Parliament in the 1990s-2020s
Olha Kanishcheva | Tetiana Kovalova | Maria Shvedova | Ruprecht von Waldenfels
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
Olha Kanishcheva | Tetiana Kovalova | Maria Shvedova | Ruprecht von Waldenfels
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
We describe a Ukrainian-Russian code-switching corpus of Ukrainian Parliamentary Session Transcripts. The corpus includes speeches entirely in Ukrainian, Russian, or various types of mixed speech and allows us to see how speakers switch between these languages depending on the communicative situation. The paper describes the process of creating this corpus from the official multilingual transcripts using automatic language detecting and publicly available metadata on the speakers. On this basis, we consider possible reasons for the change in the number of Ukrainian speakers in the parliament and present the most common patterns of bilingual Ukrainian and Russian code-switching in parliamentarians’ speeches.
Search
Fix author
Co-authors
- Voula Giouli 2
- Chaya Liebeskind 2
- Irina Lobzhanidze 2
- Stella Markantonatou 2
- Adriana Silvina Pagano 2
- Mehrnoush Shamsfard 2
- Maria Shvedova 2
- Vahide Tajalli 2
- Manzura Abjalova 1
- Sopuruchi Christian Aboh 1
- Ágnes Abuczki 1
- Maha Tufail Agro 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Dina Almassova 1
- Diego Alves 1
- Doğukan Arslan 1
- Verginica Barbu Mititelu 1
- Eric Bilinski 1
- Aida Cardoso 1
- Maria Chatzigrigoriou 1
- Kaja Dobrovoljc 1
- Roberto Díaz Hernández 1
- Nilay Erdem Ayyıldız 1
- Doruk Eryiğit 1
- Gülşen Eryiğit 1
- Louis Estève 1
- Victoria Fendel 1
- Polona Gantar 1
- Radovan Garabik 1
- Petra Giommarelli 1
- Shahar Golan 1
- Bruno Guillaume 1
- Isabell Stinessen Haugen 1
- Wei He 1
- Carlos Manuel Hidalgo-Ternero 1
- Nina Hosseini-Kivanani 1
- Shaoxiong Ji 1
- Danka Jokić 1
- Anna Kanellopoulou 1
- Muhammad Ahsan Riaz Khan 1
- Tetiana Kovalova 1
- Jauza Akbar Krito 1
- Cvetana Krstev 1
- Alesia Lazarenka 1
- Noémi Ligeti-Nagy 1
- Veronika Lipp 1
- Aleksandra M. Marković 1
- Jelena M. Marković 1
- Amália Mendes 1
- Maria Mitrofan 1
- Johanna Monti 1
- Numaan Naeem 1
- Takuya Nakamura 1
- Gunta Nešpore-Bērzkalne 1
- Sanni Nimb 1
- Nathalie Carmen Hau Norman 1
- Sussi Olsen 1
- Daniil Orel 1
- Petya Osenova 1
- Aakanksha Padhye 1
- Vasile Pais 1
- Bolette Sandford Pedersen 1
- Marija Pendevska 1
- Fred Philippy 1
- Thomas Pickard 1
- Salsabila Zahirah Pranida 1
- Carlos Ramisch 1
- María Del Mar Sánchez Ramos 1
- Rozane Rebechi 1
- Laura Rituma 1
- Ieva Rizgeliene 1
- Antoni Brosa Rodríguez 1
- Zahra Saaberi 1
- Josue Alejandro Sauca 1
- Agata Savary 1
- Manon Scholivet 1
- Regina E. Semou 1
- Masoumeh Seyyedrezaei 1
- Sarvinoz Sharipova 1
- Inguna Skadina 1
- Ranka Stankovic 1
- Sara Stymne 1
- Srdjan Sucur 1
- Hanna Sytar 1
- Carole Tiberius 1
- Dilara Torunoğlu-Selamet 1
- Samia Touileb 1
- Eleni Triantafyllidi 1
- Kingsley O. Ugwuanyi 1
- Baiba Valkovska 1
- Giedre Valunaite Oleskeviciene 1
- Erik Velldal 1
- Aline Villavicencio 1
- Rodrigo Wilkens 1
- Beata Wójtowicz 1
- Zhuohan Xie 1
- Olha Yatsyshyna 1
- Yelda Yeşildal Eraydın 1
- Ruprecht von Waldenfels 1
- Lilja Øvrelid 1
- Jaka Čibej 1