Mark Warschauer


2026

Automated Essay Scoring (AES) is shifting from feature-engineering to LLMs, yet current training-free approaches struggle with calibration, often exhibiting a "middle-score bias" that fails to distinguish between exceptional and weak writings. In this work, we introduce MADRAG (Multi-Agent Debate with Retrieval-Augmented Generation), a training-free framework designed to achieve the reliability of supervised models without the need for labeled training data. MADRAG decomposes the scoring process into a multi-agent interaction: an Advocate highlights essay strengths, a Skeptic critiques weaknesses, and a Judge synthesizes these arguments to assign a score. Crucially, we augment the Judge with RAG mechanism that retrieves rubric-aligned exemplar essays spanning the full score range, grounding the debate in concrete evidence. Evaluating our approach on the ASAP dataset for analytic trait scoring, we demonstrate that MADRAG significantly outperforms existing prompt-based LLM baselines and achieves performance competitive with state-of-the-art supervised models.
Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.
Multi-agent debate systems are typically evaluated only on whether thefinal answer is correct, overlooking the quality of the intermediatereasoning that debate is designed to produce. This paper studies therelationship between three signals in multi-agent debate: token-levellog-probability distributions over reasoning tokens, LLM-as-judge rubricscores assigned to those tokens, and final task accuracy. We examinewhether internal confidence signals predict externally evaluated reasoningquality, and whether either signal aligns with task correctness, acrossthree domains: rubric-based scoring, mathematical reasoning, and factualquestion answering. Our framework pairs a two-agent debate architecture—a Constructor and an Auditor—with anLLM-as-judge that scores each agent’s reasoning along instructionfollowing, justification quality, and evidence grounding, together with acritical-failure flag. Experiments in the rubric-scoring domain reveal aconsistent four-phase confidence trajectory and a substantial roleasymmetry: confidence aligns with judged reasoning quality roughly twiceas strongly for the Constructor as for the Auditor, and confidence-based detection ofcritical reasoning failures is markedly more reliable for the Constructor(AUROC 0.804) than for the Auditor (0.634). These findings motivate thebroader cross-domain investigation proposed in this paper.

2022

Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models’ fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.