Daniil Orel
2026
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human–machine classification under in-distribution settings. To bridge this gap, we introduce AICD Bench, the most comprehensive benchmark for AI-generated code detection. It spans 2M examples, 77 models across 11 families, and 9 programming languages, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: (i) Robust Binary Classification under distribution shifts in language and domain, (ii) Model Family Attribution, grouping generators by architectural lineage, and (iii) Fine-Grained Human–Machine Classification across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a unified, challenging evaluation suite to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench.
2025
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Nurkhan Laiyk | Daniil Orel | Rituraj Joshi | Maiya Goloburda | Yuxia Wang | Preslav Nakov | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nurkhan Laiyk | Daniil Orel | Rituraj Joshi | Maiya Goloburda | Yuxia Wang | Preslav Nakov | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs’ understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation
Emilio Villa-Cueva | Sholpan Bolatzhanova | Diana Turmakhan | Kareem Elzeky | Henok Biadglign Ademtew | Alham Fikri Aji | Vladimir Araujo | Israel Abebe Azime | Jinheon Baek | Frederico Belcavello | Fermin Cristobal | Jan Christian Blaise Cruz | Mary Dabre | Raj Dabre | Toqeer Ehsan | Naome A Etori | Fauzan Farooqui | Jiahui Geng | Guido Ivetta | Thanmay Jayakumar | Soyeong Jeong | Zheng Wei Lim | Aishik Mandal | Sofía Martinelli | Mihail Minkov Mihaylov | Daniil Orel | Aniket Pramanick | Sukannya Purkayastha | Israfel Salazar | Haiyue Song | Tiago Timponi Torrent | Debela Desalegn Yadeta | Injy Hamed | Atnafu Lambebo Tonja | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025
Emilio Villa-Cueva | Sholpan Bolatzhanova | Diana Turmakhan | Kareem Elzeky | Henok Biadglign Ademtew | Alham Fikri Aji | Vladimir Araujo | Israel Abebe Azime | Jinheon Baek | Frederico Belcavello | Fermin Cristobal | Jan Christian Blaise Cruz | Mary Dabre | Raj Dabre | Toqeer Ehsan | Naome A Etori | Fauzan Farooqui | Jiahui Geng | Guido Ivetta | Thanmay Jayakumar | Soyeong Jeong | Zheng Wei Lim | Aishik Mandal | Sofía Martinelli | Mihail Minkov Mihaylov | Daniil Orel | Aniket Pramanick | Sukannya Purkayastha | Israfel Salazar | Haiyue Song | Tiago Timponi Torrent | Debela Desalegn Yadeta | Injy Hamed | Atnafu Lambebo Tonja | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025
Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal translation systems that are better aligned with cultural nuance and regional variations.
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
Daniil Orel | Dilshod Azizov | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Daniil Orel | Dilshod Azizov | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, this has had important consequences for programming skills, ethics, and assessment integrity, thus making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some previous research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. Here, we aim to bridge this gap. In particular, we propose a framework capable of distinguishing between human-written and LLM-generated program code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Our extensive experiments show that our framework effectively distinguishes human-written from LLM-generated program code, setting a new benchmark for the task.
Qorǵau: Evaluating Safety in Kazakh-Russian Bilingual Contexts
Maiya Goloburda | Nurkhan Laiyk | Diana Turmakhan | Yuxia Wang | Mukhammed Togmanov | Jonibek Mansurov | Askhat Sametov | Nurdaulet Mukhituly | Minghan Wang | Daniil Orel | Zain Muhammad Mujahid | Fajri Koto | Timothy Baldwin | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Maiya Goloburda | Nurkhan Laiyk | Diana Turmakhan | Yuxia Wang | Mukhammed Togmanov | Jonibek Mansurov | Askhat Sametov | Nurdaulet Mukhituly | Minghan Wang | Daniil Orel | Zain Muhammad Mujahid | Fajri Koto | Timothy Baldwin | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorǵau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
Droid: A Resource Suite for AI-Generated Code Detection
Daniil Orel | Indraneil Paul | Iryna Gurevych | Preslav Nakov
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Daniil Orel | Indraneil Paul | Iryna Gurevych | Preslav Nakov
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present DroidCollection, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and three real-world coding domains. Alongside fully AI-generated examples, our collection includes human-AI co-authored code, as well as adversarial examples explicitly crafted to evade detection. Subsequently, we develop DroidDetect, a suite of encoder-only detectors trained using a multi-task objective over DroidCollection. Our experiments show that existing detectors’ performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. We further demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small number of adversarial examples. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as way to enhance detector training on possibly noisy distributions.
Search
Fix author
Co-authors
- Preslav Nakov 5
- Yuxia Wang 3
- Dilshod Azizov 2
- Maiya Goloburda 2
- Iryna Gurevych 2
- Fajri Koto 2
- Nurkhan Laiyk 2
- Indraneil Paul 2
- Diana Turmakhan 2
- Manzura Abjalova 1
- Sopuruchi Christian Aboh 1
- Ágnes Abuczki 1
- Henok Biadglign Ademtew 1
- Maha Tufail Agro 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Alham Fikri Aji 1
- Dina Almassova 1
- Diego Alves 1
- Vladimir Araujo 1
- Doğukan Arslan 1
- Israel Abebe Azime 1
- Jinheon Baek 1
- Timothy Baldwin 1
- Frederico Belcavello 1
- Sholpan Bolatzhanova 1
- Aida Cardoso 1
- Maria Chatzigrigoriou 1
- Fermin Cristobal 1
- Jan Christian Blaise Cruz 1
- Mary Dabre 1
- Raj Dabre 1
- Kaja Dobrovoljc 1
- Toqeer Ehsan 1
- Kareem Elzeky 1
- Nilay Erdem Ayyıldız 1
- Doruk Eryiğit 1
- Gülşen Eryiğit 1
- Naome A. Etori 1
- Fauzan Farooqui 1
- Radovan Garabik 1
- Jiahui Geng 1
- Petra Giommarelli 1
- Voula Giouli 1
- Shahar Golan 1
- Injy Hamed 1
- Isabell Stinessen Haugen 1
- Wei He 1
- Carlos Manuel Hidalgo-Ternero 1
- Nina Hosseini-Kivanani 1
- Guido Ivetta 1
- Thanmay Jayakumar 1
- Soyeong Jeong 1
- Shaoxiong Ji 1
- Danka Jokić 1
- Rituraj Joshi 1
- Anna Kanellopoulou 1
- Olha Kanishcheva 1
- Muhammad Ahsan Riaz Khan 1
- Jauza Akbar Krito 1
- Alesia Lazarenka 1
- Chaya Liebeskind 1
- Noémi Ligeti-Nagy 1
- Zheng Wei Lim 1
- Veronika Lipp 1
- Irina Lobzhanidze 1
- Aishik Mandal 1
- Jonibek Mansurov 1
- Stella Markantonatou 1
- Jelena M. Marković 1
- Sofía Martinelli 1
- Amália Mendes 1
- Mihail Minkov Mihaylov 1
- Johanna Monti 1
- Zain Muhammad Mujahid 1
- Nurdaulet Mukhituly 1
- Numaan Naeem 1
- Sanni Nimb 1
- Nathalie Carmen Hau Norman 1
- Sussi Olsen 1
- Petya Osenova 1
- Adriana Silvina Pagano 1
- Bolette Sandford Pedersen 1
- Marija Pendevska 1
- Fred Philippy 1
- Thomas Pickard 1
- Aniket Pramanick 1
- Salsabila Zahirah Pranida 1
- Sukannya Purkayastha 1
- María Del Mar Sánchez Ramos 1
- Rozane Rebechi 1
- Laura Rituma 1
- Ieva Rizgeliene 1
- Antoni Brosa Rodríguez 1
- Zahra Saaberi 1
- Israfel Salazar 1
- Askhat Sametov 1
- Josue Alejandro Sauca 1
- Regina E. Semou 1
- Masoumeh Seyyedrezaei 1
- Mehrnoush Shamsfard 1
- Sarvinoz Sharipova 1
- Inguna Skadina 1
- Thamar Solorio 1
- Haiyue Song 1
- Srdjan Sucur 1
- Vahide Tajalli 1
- Mukhammed Togmanov 1
- Atnafu Lambebo Tonja 1
- Tiago Timponi Torrent 1
- Dilara Torunoğlu-Selamet 1
- Samia Touileb 1
- Eleni Triantafyllidi 1
- Kingsley O. Ugwuanyi 1
- Baiba Valkovska 1
- Giedre Valunaite Oleskeviciene 1
- Erik Velldal 1
- Emilio Villa-Cueva 1
- Aline Villavicencio 1
- Minghan Wang 1
- Rodrigo Wilkens 1
- Beata Wójtowicz 1
- Zhuohan Xie 1
- Debela Desalegn Yadeta 1
- Olha Yatsyshyna 1
- Yelda Yeşildal Eraydın 1
- Lilja Øvrelid 1