Daniil Orel
2026
AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human–machine classification under in-distribution settings. To bridge this gap, we introduce AICD Bench, the most comprehensive benchmark for AI-generated code detection. It spans 2M examples, 77 models across 11 families, and 9 programming languages, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: (i) Robust Binary Classification under distribution shifts in language and domain, (ii) Model Family Attribution, grouping generators by architectural lineage, and (iii) Fine-Grained Human–Machine Classification across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a unified, challenging evaluation suite to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench.
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
Stereotype Bias in a Bilingual Setting: A Culturally Grounded Evaluation in Kazakhstan
Nurkhan Laiyk | Daniil Orel | Ayana Mussabayeva | Maiya Goloburda | Kamila Kuishibekova | Liya Goloburda | Diana Turmakhan | Preslav Nakov | Yuxia Wang | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nurkhan Laiyk | Daniil Orel | Ayana Mussabayeva | Maiya Goloburda | Kamila Kuishibekova | Liya Goloburda | Diana Turmakhan | Preslav Nakov | Yuxia Wang | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Stereotype bias in language models has been widely examined in English, but remains largely understudied in bilingual contexts where multiple linguistic and cultural systems interact. This gap is especially important in regions where language use reflects complex historical and sociopolitical influences. In this work, we focus on Kazakhstan, a bilingual society where Kazakh, a low-resource Turkic language, and Russian, a high-resource Slavic language, are both actively used and frequently code-mixed in everyday communication. We introduce Aqbileq, a high-quality, human-verified dataset consisting of 5,634 stereotype-bearing statements in Kazakh, Russian, and code-mixed forms, covering six culturally salient domains. We evaluate both multilingual and Kazakh-specific language models using perplexity-based scoring and pretraining simulations, and find that stereotype bias is most pronounced in code-mixed inputs. Our results highlight the limitations of existing evaluation frameworks and emphasize the need for culturally grounded, linguistically inclusive benchmarks to better assess and mitigate bias in language models.
2025
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation
Emilio Villa-Cueva | Sholpan Bolatzhanova | Diana Turmakhan | Kareem Elzeky | Henok Biadglign Ademtew | Alham Fikri Aji | Vladimir Araujo | Israel Abebe Azime | Jinheon Baek | Frederico Belcavello | Fermin Cristobal | Jan Christian Blaise Cruz | Mary Dabre | Raj Dabre | Toqeer Ehsan | Naome A Etori | Fauzan Farooqui | Jiahui Geng | Guido Ivetta | Thanmay Jayakumar | Soyeong Jeong | Zheng Wei Lim | Aishik Mandal | Sofía Martinelli | Mihail Minkov Mihaylov | Daniil Orel | Aniket Pramanick | Sukannya Purkayastha | Israfel Salazar | Haiyue Song | Tiago Timponi Torrent | Debela Desalegn Yadeta | Injy Hamed | Atnafu Lambebo Tonja | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025
Emilio Villa-Cueva | Sholpan Bolatzhanova | Diana Turmakhan | Kareem Elzeky | Henok Biadglign Ademtew | Alham Fikri Aji | Vladimir Araujo | Israel Abebe Azime | Jinheon Baek | Frederico Belcavello | Fermin Cristobal | Jan Christian Blaise Cruz | Mary Dabre | Raj Dabre | Toqeer Ehsan | Naome A Etori | Fauzan Farooqui | Jiahui Geng | Guido Ivetta | Thanmay Jayakumar | Soyeong Jeong | Zheng Wei Lim | Aishik Mandal | Sofía Martinelli | Mihail Minkov Mihaylov | Daniil Orel | Aniket Pramanick | Sukannya Purkayastha | Israfel Salazar | Haiyue Song | Tiago Timponi Torrent | Debela Desalegn Yadeta | Injy Hamed | Atnafu Lambebo Tonja | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025
Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal translation systems that are better aligned with cultural nuance and regional variations.
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
Daniil Orel | Dilshod Azizov | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Daniil Orel | Dilshod Azizov | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, this has had important consequences for programming skills, ethics, and assessment integrity, thus making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some previous research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. Here, we aim to bridge this gap. In particular, we propose a framework capable of distinguishing between human-written and LLM-generated program code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Our extensive experiments show that our framework effectively distinguishes human-written from LLM-generated program code, setting a new benchmark for the task.
Qorǵau: Evaluating Safety in Kazakh-Russian Bilingual Contexts
Maiya Goloburda | Nurkhan Laiyk | Diana Turmakhan | Yuxia Wang | Mukhammed Togmanov | Jonibek Mansurov | Askhat Sametov | Nurdaulet Mukhituly | Minghan Wang | Daniil Orel | Zain Muhammad Mujahid | Fajri Koto | Timothy Baldwin | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Maiya Goloburda | Nurkhan Laiyk | Diana Turmakhan | Yuxia Wang | Mukhammed Togmanov | Jonibek Mansurov | Askhat Sametov | Nurdaulet Mukhituly | Minghan Wang | Daniil Orel | Zain Muhammad Mujahid | Fajri Koto | Timothy Baldwin | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorǵau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
Droid: A Resource Suite for AI-Generated Code Detection
Daniil Orel | Indraneil Paul | Iryna Gurevych | Preslav Nakov
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Daniil Orel | Indraneil Paul | Iryna Gurevych | Preslav Nakov
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present DroidCollection, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and three real-world coding domains. Alongside fully AI-generated examples, our collection includes human-AI co-authored code, as well as adversarial examples explicitly crafted to evade detection. Subsequently, we develop DroidDetect, a suite of encoder-only detectors trained using a multi-task objective over DroidCollection. Our experiments show that existing detectors’ performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. We further demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small number of adversarial examples. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as way to enhance detector training on possibly noisy distributions.
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Nurkhan Laiyk | Daniil Orel | Rituraj Joshi | Maiya Goloburda | Yuxia Wang | Preslav Nakov | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nurkhan Laiyk | Daniil Orel | Rituraj Joshi | Maiya Goloburda | Yuxia Wang | Preslav Nakov | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs’ understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
Search
Fix author
Co-authors
- Preslav Nakov 7
- Yuxia Wang 5
- Fajri Koto 4
- Maiya Goloburda 3
- Nurkhan Laiyk 3
- Diana Turmakhan 3
- Dilshod Azizov 2
- Iryna Gurevych 2
- Indraneil Paul 2
- Henok Biadglign Ademtew 1
- Alham Fikri Aji 1
- Sophia Ananiadou 1
- Vladimir Araujo 1
- Israel Abebe Azime 1
- Jinheon Baek 1
- Timothy Baldwin 1
- Debopriyo Banerjee 1
- Frederico Belcavello 1
- Sholpan Bolatzhanova 1
- Tanmoy Chakraborty 1
- Fermin Cristobal 1
- Jan Christian Blaise Cruz 1
- Mary Dabre 1
- Raj Dabre 1
- Toqeer Ehsan 1
- Rania Elbadry 1
- Kareem Elzeky 1
- Naome A. Etori 1
- Fauzan Farooqui 1
- Jiahui Geng 1
- Georgi Nenkov Georgiev 1
- Liya Goloburda 1
- Injy Hamed 1
- Jimin Huang 1
- Guido Ivetta 1
- Thanmay Jayakumar 1
- Soyeong Jeong 1
- Rituraj Joshi 1
- Ivan Koychev 1
- Kamila Kuishibekova 1
- Salem Lahlou 1
- Haonan Li 1
- Zheng Wei Lim 1
- Hachem Madmoun 1
- Aishik Mandal 1
- Jonibek Mansurov 1
- Sofía Martinelli 1
- Mihail Minkov Mihaylov 1
- Zain Muhammad Mujahid 1
- Nurdaulet Mukhituly 1
- Ayana Mussabayeva 1
- Xueqing Peng 1
- Aniket Pramanick 1
- Sukannya Purkayastha 1
- Lingfei Qian 1
- Dhruv Sahnan 1
- Israfel Salazar 1
- Askhat Sametov 1
- Aaryamonvikram Singh 1
- Thamar Solorio 1
- Haiyue Song 1
- Veselin Stoyanov 1
- Jinyan Su 1
- Rushil Thareja 1
- Mukhammed Togmanov 1
- Atnafu Lambebo Tonja 1
- Tiago Timponi Torrent 1
- Emilio Villa-Cueva 1
- Minghan Wang 1
- Zhuohan Xie 1
- Rui Xing 1
- Chen Xu 1
- Debela Desalegn Yadeta 1
- Fan Zhang 1