2025
pdf
bib
abs
MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
Marius Dumitran
|
Mihnea Buca
|
Theodor Moroianu
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
The rapid advancement of Large Language Models (LLMs) has transformed various domains, particularly computer science (CS) education. These models exhibit remarkable capabilities in code-related tasks and problem-solving, raising questions about their potential and limitations in advanced CS contexts. This study presents a novel bilingual (English–Romanian) multimodal (text and image) dataset of multiple-choice questions derived from a high-level computer science competition. A particularity of our dataset is that the problems are conceived such that some of them are easier solved using reasoning on paper, while for others writing code is more efficient. We systematically evaluate State of The Art LLMs on this dataset, analyzing their performance on theoretical programming tasks. Our findings reveal the strengths and limitations of current LLMs, including the influence of language choice (English vs. Romanian), providing insights into their applicability in CS education and competition settings. We also address critical ethical considerations surrounding educational integrity and the fairness of assessments in the context of LLM usage. These discussions aim to inform future educational practices and policies. To support further research, our dataset will be made publicly available in both English and Romanian. Additionally, we release an educational application tailored for Romanian students, enabling them to self-assess using the dataset in an interactive and practice-oriented environment.
pdf
bib
abs
Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests
Stefan Dascalescu
|
Marius Dumitran
|
Mihai Alexandru Vasiluta
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Competitive programming contests play a crucial role in cultivating computational thinking and algorithmic skills among learners. However, generating comprehensive test cases to effectively assess programming solutions remains resource-intensive and challenging for educators. This paper introduces an innovative NLP-driven method leveraging generative AI (large language models) to automate the creation of high-quality test cases for competitive programming assessments. We extensively evaluated our approach on diverse datasets, including 25 years of Romanian Informatics Olympiad (OJI) data for 5th graders, recent competitions hosted on the Kilonova.ro platform, and the International Informatics Olympiad in Teams (IIOT). Our results demonstrate that AI-generated test cases substantially enhanced assessments, notably identifying previously undetected errors in 67% of the OJI 5th grade programming problems. These improvements underscore the complementary educational value of our technique in formative assessment contexts. By openly sharing our prompts, translated datasets, and methodologies, we offer practical NLP-based tools that educators and contest organizers can readily integrate to enhance assessment quality, reduce workload, and deepen insights into learner performance.We have uploaded a demo which showcases the process of using the prompt in order to generate the test cases for one of the problems from the Kilonova.ro platform, which is accessible through the file we uploaded in the supplementary material section.
pdf
bib
abs
GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs
Marius Dumitran
|
Angela Dumitran
|
Alexandra Mihaela Danila
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Large language models (LLMs) have revolutionised NLP, yet their pedagogical value for low‐resource languages remains unclear. We present GRILE, the first open benchmark of 1 151 multiple‐choice questions harvested from Romanian high‐stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state‐of‐the‐art multilingual and Romanian‐specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically faithful explanations. While Gemini 2·5 Pro reaches 83% accuracy, most open‐weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM 3 orthographic norms. All data, code and a public web demo are released to catalyse future research. Our findings expose open challenges for trustworthy educational NLP in low‐resource settings and establish GRILE as a new test‐bed for controllable explanation generation and evaluation.
pdf
bib
abs
A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos
Alexandru Ganea
|
Antonia-Adelina Popovici
|
Marius Dumitran
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show “Who Wants to Be a Millionaire?” (Vrei să fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available.