Ali Keramati


2026

Automated Essay Scoring (AES) is shifting from feature-engineering to LLMs, yet current training-free approaches struggle with calibration, often exhibiting a "middle-score bias" that fails to distinguish between exceptional and weak writings. In this work, we introduce MADRAG (Multi-Agent Debate with Retrieval-Augmented Generation), a training-free framework designed to achieve the reliability of supervised models without the need for labeled training data. MADRAG decomposes the scoring process into a multi-agent interaction: an Advocate highlights essay strengths, a Skeptic critiques weaknesses, and a Judge synthesizes these arguments to assign a score. Crucially, we augment the Judge with RAG mechanism that retrieves rubric-aligned exemplar essays spanning the full score range, grounding the debate in concrete evidence. Evaluating our approach on the ASAP dataset for analytic trait scoring, we demonstrate that MADRAG significantly outperforms existing prompt-based LLM baselines and achieves performance competitive with state-of-the-art supervised models.
Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.
Multi-agent debate systems are typically evaluated only on whether thefinal answer is correct, overlooking the quality of the intermediatereasoning that debate is designed to produce. This paper studies therelationship between three signals in multi-agent debate: token-levellog-probability distributions over reasoning tokens, LLM-as-judge rubricscores assigned to those tokens, and final task accuracy. We examinewhether internal confidence signals predict externally evaluated reasoningquality, and whether either signal aligns with task correctness, acrossthree domains: rubric-based scoring, mathematical reasoning, and factualquestion answering. Our framework pairs a two-agent debate architecture—a Constructor and an Auditor—with anLLM-as-judge that scores each agent’s reasoning along instructionfollowing, justification quality, and evidence grounding, together with acritical-failure flag. Experiments in the rubric-scoring domain reveal aconsistent four-phase confidence trajectory and a substantial roleasymmetry: confidence aligns with judged reasoning quality roughly twiceas strongly for the Constructor as for the Auditor, and confidence-based detection ofcritical reasoning failures is markedly more reliable for the Constructor(AUROC 0.804) than for the Auditor (0.634). These findings motivate thebroader cross-domain investigation proposed in this paper.