@inproceedings{burleigh-etal-2025-pre,
    title = "Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data",
    author = "Burleigh, Tyler  and
      Chen, Jing  and
      Dicerbo, Kristen",
    editor = "Wilson, Joshua  and
      Ormerod, Christopher  and
      Beiting Parrish, Magdalen",
    booktitle = "Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers",
    month = oct,
    year = "2025",
    address = "Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States",
    publisher = "National Council on Measurement in Education (NCME)",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.aimecon-sessions.7/",
    pages = "61--68",
    ISBN = "979-8-218-84230-7",
    abstract = "Story retell assessments provide valuable insights into reading comprehension but face implementation barriers due to time-intensive administration and scoring. This study examines whether Large Language Models (LLMs) can reliably replicate human judgment in grading story retells. Using a novel dataset, we conduct three complementary studies examining LLM performance across different rubric systems, agreement patterns, and reasoning alignment. We find that LLMs (a) achieve near-human reliability with appropriate rubric design, (b) perform well on easy-to-grade cases but poorly on ambiguous ones, (c) produce explanations for their grades that are plausible for straightforward cases but unreliable for complex ones, and (d) different LLMs display consistent ``grading personalities'' (systematically scoring harder or easier across all student responses). These findings support hybrid assessment architectures where AI handles routine scoring, enabling more frequent formative assessment while directing teacher expertise toward students requiring nuanced support."
}Markdown (Informal)
[Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data](https://preview.aclanthology.org/ingest-emnlp/2025.aimecon-sessions.7/) (Burleigh et al., AIME-Con 2025)
ACL