2025
pdf
bib
abs
Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction
Suchir Salhan
|
Hongyi Gu
|
Donya Rooein
|
Diana Galvan-Sosa
|
Gabrielle Gaudeau
|
Andrew Caines
|
Zheng Yuan
|
Paula Buttery
Proceedings of the First BabyLM Workshop
Multi-turn dialogues between a child and caregiver are characterized by a property called contingency – prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a Teacher–Student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive Teacher decoding strategies show limited additional gains. ContingentChat highlights the positive benefits of targeted post-training on dialogue quality and presents contingency as a challenging goal for BabyLMs.
pdf
bib
abs
Large Language Models for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward
Sankalan Pal Chowdhury
|
Nico Daheim
|
Ekaterina Kochmar
|
Jakub Macina
|
Donya Rooein
|
Mrinmaya Sachan
|
Shashank Sonkar
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This tutorial will aim to bridge the gap between NLP researchers and Artificial Intelligence in Education (AIED) practitioners to help participants understand the requirements and challenges of education, enabling them to develop LLMs that align with educational needs, and to enable educators to gain a deeper understanding of the capabilities and limitations of current NLP technologies, fostering effective integration of LLMs in educational contexts.
pdf
bib
abs
Educators’ Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only Setting
Sankalan Pal Chowdhury
|
Terry Jingchen Zhang
|
Donya Rooein
|
Dirk Hovy
|
Tanja Käser
|
Mrinmaya Sachan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
The rapid development of Large Language Models (LLMs) opens up the possibility of using them aspersonal tutors. This has led to the development of several intelligent tutoring systems and learning assistants that use LLMs as back-ends with various degrees of engineering. In this study, we seek to compare human tutors with LLM tutorsin terms of engagement, empathy, scaffolding, and conciseness. We ask human tutors to compare the performance of an LLM tutor with that of a human tutor in teaching grade-school math word problems on these qualities. We find that annotators with teaching experience perceive LLMs as showing higher performance than human tutors in all 4 metrics. The biggest advantage is in empathy, where 80% of our annotators prefer the LLM tutor more often than the human tutors. Our study paints a positive picture of LLMs as tutors and indicates that these models can be used to reduce the load on human teachers in the future.
pdf
bib
abs
Are Large Language Models for Education Reliable Across Languages?
Vansh Gupta
|
Sankalan Pal Chowdhury
|
Vilém Zouhar
|
Donya Rooein
|
Mrinmaya Sachan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in eight languages (Mandarin, Hindi, Arabic, German, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. However, at least some models are able to more or less maintain their levels of performance across all languages. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.
pdf
bib
abs
Biased Tales: Cultural and Topic Bias in Generating Children’s Stories
Donya Rooein
|
Vilém Zouhar
|
Debora Nozza
|
Dirk Hovy
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists’ attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.
pdf
bib
abs
Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
Chenfei Xiong
|
Jingwei Ni
|
Yu Fan
|
Vilém Zouhar
|
Donya Rooein
|
Lorena Calvo-Bartolomé
|
Alexander Miserlis Hoyle
|
Zhijing Jin
|
Mrinmaya Sachan
|
Markus Leippold
|
Dirk Hovy
|
Mennatallah El-Assady
|
Elliott Ash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.
pdf
bib
abs
Can I Introduce My Boyfriend to My Grandmother? Evaluating Large Language Models Capabilities on Iranian Social Norm Classification
Hamidreza Saffari
|
Mohammadamin Shafiei
|
Donya Rooein
|
Francesco Pierri
|
Debora Nozza
Findings of the Association for Computational Linguistics: NAACL 2025
Creating globally inclusive AI systems demands datasets reflecting diverse social norms. Iran, with its unique cultural blend, offers an ideal case study, with Farsi adding linguistic complexity. In this work, we introduce the Iranian Social Norms (ISN) dataset, a novel collection of 1,699 Iranian social norms, including environments, demographic features, and scope annotation, alongside English translations. Our evaluation of 6 Large Language Models (LLMs) in classifying Iranian social norms, using a variety of prompts, uncovered critical insights into the impact of geographic and linguistic context. Results revealed a substantial performance gap in LLMs’ comprehension of Iranian norms. Notably, while the geographic context in English prompts enhanced the performance, this effect was absent in Farsi, pointing to nuanced linguistic challenges. Particularly, performance was significantly worse for Iran-specific norms, emphasizing the importance of culturally tailored datasets. As the first Farsi dataset for social norm classification, ISN will facilitate crucial cross-cultural analyses, shedding light on how values differ across contexts and cultures.
pdf
bib
abs
Measuring Gender Bias in Language Models in Farsi
Hamidreza Saffari
|
Mohammadamin Shafiei
|
Donya Rooein
|
Debora Nozza
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
As Natural Language Processing models become increasingly embedded in everyday life, ensuring that these systems can measure and mitigate bias is critical. While substantial work has been done to identify and mitigate gender bias in English, Farsi remains largely underexplored. This paper presents the first comprehensive study of gender bias in language models in Farsi across three tasks: emotion analysis, question answering, and hurtful sentence completion. We assess a range of language models across all the tasks in zero-shot settings. By adapting established evaluation frameworks for Farsi, we uncover patterns of gender bias that differ from those observed in English, highlighting the urgent need for culturally and linguistically inclusive approaches to bias mitigation in NLP.
2024
pdf
bib
abs
Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts
Donya Rooein
|
Paul Röttger
|
Anastassia Shaitarova
|
Dirk Hovy
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM’s general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.