Jamie Mikeska

2026

Data-lean fine-tuning of models for evaluating teacher performance in a GenAI-led elicitation simulation
Beata Beigman Klebanov | Andrew Hoang | Jamie Mikeska | Benny Longwill | Sanjna Kashyap | Shreyashi Halder | Aakanksha Bhatia
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Recent advances in the capabilities of conversational agents based on large language models make them a very promising tool for role playing K-12 students in order to train educators in conversational teaching practices, such as eliciting student thinking, explaining disciplinary content, and facilitating a classroom discussion. In fact, such simulations can and have been developed relatively quickly and without data to machine-learn from – neither classroom data nor human-simulated data. To enhance the usefulness and effectiveness of such teaching simulations, it is necessary to provide pedagogically sound, timely, and personalized feedback to the educator about their simulation performance. In this study, we present experiments on fine-tuning models to evaluate educator performance in an elicitation teaching simulation. The models are developed with data collected during usability testing of the simulation and evaluated on real user data. We show that even with relatively little fine-tuning data, robust performance can be obtained

2025

pdf bib abs

Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o
Yuya Asano | Beata Beigman Klebanov | Jamie Mikeska
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Engaging students in a coherent classroom discussion is one aspect of high-quality instruction and is an important skill that requires practice to acquire. With the goal of providing teachers with formative feedback on their classroom discussions, we investigate automated means for evaluating teachers’ ability to lead coherent discussions in simulated classrooms. While prior work has shown the effectiveness of large language models (LLMs) in assessing the coherence of relatively short texts, it has also found that LLMs struggle when assessing instructional quality. We evaluate the generalizability of task formulation strategies for assessing the coherence of classroom discussions across different subject domains using GPT-4o and discuss how these formulations address the previously reported challenges—the overestimation of instructional quality and the inability to extract relevant parts of discussions. Finally, we report lack of generalizability across domains and the misalignment with humans in the use of evidence from discussions as remaining challenges.

2024

pdf bib abs

CAMAL: A Novel Dataset for Multi-label Conversational Argument Move Analysis
Viet Dac Lai | Duy Ngoc Pham | Jonathan Steinberg | Jamie Mikeska | Thien Huu Nguyen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Understanding the discussion moves that teachers and students use to engage in classroom discussions is important to support pre-service teacher learning and teacher educators. This work introduces a novel conversational multi-label corpus of teaching transcripts collected from a simulated classroom environment for Conversational Argument Move AnaLysis (CAMAL). The dataset offers various argumentation moves used by pre-service teachers and students in mathematics and science classroom discussions. The dataset includes 165 transcripts from these discussions that pre-service elementary teachers facilitated in a simulated classroom environment of five student avatars. The discussion transcripts were annotated by education assessment experts for nine argumentation moves (aka. intents) used by the pre-service teachers and students during the discussions. In this paper, we describe the dataset, our annotation framework, and the models we employed to detect argumentation moves. Our experiments with state-of-the-art models demonstrate the complexity of the CAMAL task presented in the dataset. The result reveals that models that combined CNN and LSTM structures with speaker ID graphs improved the F1-score of our baseline models to detect speakers’ intents by a large margin. Given the complexity of the CAMAL task, it creates research opportunities for future studies. We share the dataset, the source code, and the annotation framework publicly at http://github.com/uonlp/camal-dataset.

pdf bib abs

Automated Evaluation of Teacher Encouragement of Student-to-Student Interactions in a Simulated Classroom Discussion
Michael Ilagan | Beata Beigman Klebanov | Jamie Mikeska
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Leading students to engage in argumentation-focused discussions is a challenge for elementary school teachers, as doing so requires facilitating group discussions with student-to-student interaction. The Mystery Powder (MP) Task was designed to be used in online simulated classrooms to develop teachers’ skill in facilitating small group science discussions. In order to provide timely and scaleable feedback to teachers facilitating a discussion in the simulated classroom, we employ a hybrid modeling approach that successfully combines fine-tuned large language models with features capturing important elements of the discourse dynamic to evaluate MP discussion transcripts. To our knowledge, this is the first application of a hybrid model to automate evaluation of teacher discourse.

Co-authors

Venues

Fix author