2025
pdf
bib
abs
Large Language Models for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward
Sankalan Pal Chowdhury
|
Nico Daheim
|
Ekaterina Kochmar
|
Jakub Macina
|
Donya Rooein
|
Mrinmaya Sachan
|
Shashank Sonkar
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This tutorial will aim to bridge the gap between NLP researchers and Artificial Intelligence in Education (AIED) practitioners to help participants understand the requirements and challenges of education, enabling them to develop LLMs that align with educational needs, and to enable educators to gain a deeper understanding of the capabilities and limitations of current NLP technologies, fostering effective integration of LLMs in educational contexts.
pdf
bib
abs
Educators’ Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only Setting
Sankalan Pal Chowdhury
|
Terry Jingchen Zhang
|
Donya Rooein
|
Dirk Hovy
|
Tanja Käser
|
Mrinmaya Sachan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
The rapid development of Large Language Models (LLMs) opens up the possibility of using them aspersonal tutors. This has led to the development of several intelligent tutoring systems and learning assistants that use LLMs as back-ends with various degrees of engineering. In this study, we seek to compare human tutors with LLM tutorsin terms of engagement, empathy, scaffolding, and conciseness. We ask human tutors to compare the performance of an LLM tutor with that of a human tutor in teaching grade-school math word problems on these qualities. We find that annotators with teaching experience perceive LLMs as showing higher performance than human tutors in all 4 metrics. The biggest advantage is in empathy, where 80% of our annotators prefer the LLM tutor more often than the human tutors. Our study paints a positive picture of LLMs as tutors and indicates that these models can be used to reduce the load on human teachers in the future.
pdf
bib
abs
Are Large Language Models for Education Reliable Across Languages?
Vansh Gupta
|
Sankalan Pal Chowdhury
|
Vilém Zouhar
|
Donya Rooein
|
Mrinmaya Sachan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in eight languages (Mandarin, Hindi, Arabic, German, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. However, at least some models are able to more or less maintain their levels of performance across all languages. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.
pdf
bib
abs
Can I Introduce My Boyfriend to My Grandmother? Evaluating Large Language Models Capabilities on Iranian Social Norm Classification
Hamidreza Saffari
|
Mohammadamin Shafiei
|
Donya Rooein
|
Francesco Pierri
|
Debora Nozza
Findings of the Association for Computational Linguistics: NAACL 2025
Creating globally inclusive AI systems demands datasets reflecting diverse social norms. Iran, with its unique cultural blend, offers an ideal case study, with Farsi adding linguistic complexity. In this work, we introduce the Iranian Social Norms (ISN) dataset, a novel collection of 1,699 Iranian social norms, including environments, demographic features, and scope annotation, alongside English translations. Our evaluation of 6 Large Language Models (LLMs) in classifying Iranian social norms, using a variety of prompts, uncovered critical insights into the impact of geographic and linguistic context. Results revealed a substantial performance gap in LLMs’ comprehension of Iranian norms. Notably, while the geographic context in English prompts enhanced the performance, this effect was absent in Farsi, pointing to nuanced linguistic challenges. Particularly, performance was significantly worse for Iran-specific norms, emphasizing the importance of culturally tailored datasets. As the first Farsi dataset for social norm classification, ISN will facilitate crucial cross-cultural analyses, shedding light on how values differ across contexts and cultures.
pdf
bib
abs
Measuring Gender Bias in Language Models in Farsi
Hamidreza Saffari
|
Mohammadamin Shafiei
|
Donya Rooein
|
Debora Nozza
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
As Natural Language Processing models become increasingly embedded in everyday life, ensuring that these systems can measure and mitigate bias is critical. While substantial work has been done to identify and mitigate gender bias in English, Farsi remains largely underexplored. This paper presents the first comprehensive study of gender bias in language models in Farsi across three tasks: emotion analysis, question answering, and hurtful sentence completion. We assess a range of language models across all the tasks in zero-shot settings. By adapting established evaluation frameworks for Farsi, we uncover patterns of gender bias that differ from those observed in English, highlighting the urgent need for culturally and linguistically inclusive approaches to bias mitigation in NLP.
2024
pdf
bib
abs
Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts
Donya Rooein
|
Paul Röttger
|
Anastassia Shaitarova
|
Dirk Hovy
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM’s general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.