Christopher Ormerod


2025

pdf bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Joshua Wilson | Christopher Ormerod | Magdalen Beiting Parrish
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

pdf bib
Long context Automated Essay Scoring with Language Models
Christopher Ormerod | Gitit Kehat
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

In this study, we evaluate several models that incorporate architectural modifications to overcome the length limitations of the standard transformer architecture using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.

pdf bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress
Joshua Wilson | Christopher Ormerod | Magdalen Beiting Parrish
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress

pdf bib
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers
Joshua Wilson | Christopher Ormerod | Magdalen Beiting Parrish
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers

pdf bib
Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
Christopher Ormerod
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers

This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations – a generative language model used for spell correction and an encoder-based token-classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.

pdf bib
SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction
Alexander Scarlatos | Nigel Fernandez | Christopher Ormerod | Susan Lottridge | Andrew Lan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.

2024

pdf bib
Can Language Models Guess Your Identity? Analyzing Demographic Biases in AI Essay Scoring
Alexander Kwako | Christopher Ormerod
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Large language models (LLMs) are increasingly used for automated scoring of student essays. However, these models may perpetuate societal biases if not carefully monitored. This study analyzes potential biases in an LLM (XLNet) trained to score persuasive student essays, based on data from the PERSUADE corpus. XLNet achieved strong performance based on quadratic weighted kappa, standardized mean difference, and exact agreement with human scores. Using available metadata, we performed analyses of scoring differences across gender, race/ethnicity, English language learning status, socioeconomic status, and disability status. Automated scores exhibited small magnifications of marginal differences in human scoring, favoring female students over males and White students over Black students. To further probe potential biases, we found that separate XLNet classifiers and XLNet hidden states weakly predicted demographic membership. Overall, results reinforce the need for continued fairness analyses as use of LLMs expands in education.