2025
pdf
bib
abs
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Jiazheng Li
|
Yuxiang Zhou
|
Junru Lu
|
Gladys Tyen
|
Lin Gui
|
Cesare Aloisi
|
Yulan He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Although preference optimization methods have improved reasoning performance in Large Language Models (LLMs), they often lack transparency regarding why one reasoning outcome is preferred over another. This limitation is especially critical in Automated Student Answer Scoring (ASAS), where explainability is essential to justify assessment outcomes. Verbal reinforcement learning offers the potential to generate explicit reflection, but it tends to produce superficial critiques that can harm assessment performance. Existing LLMs also struggle to reliably detect subtle reasoning errors in ASAS tasks. Moreover, manually identifying intermediate reasoning errors is expensive and difficult to scale. To address these challenges, we introduce a **contrastive reflection synthesis pipeline** that generates precise verbal feedback by identifying discrepancies in structure reasoning graph paths. Leveraging these synthetic reflection data, we propose *DARS*, a Dual-model Reflective Scoring framework featuring a dedicated Critic model trained for effective reflection. *DARS* achieves strong performance and consistently outperforms existing ASAS baselines across all evaluation metrics. Extensive experiments further provide novel insights into the value of reflection data, framework design, and the scaling behavior of *DARS*. We release the DARS code at https://github.com/lijiazheng99/DARS.
pdf
bib
abs
AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment
Jiazheng Li
|
Artem Bobrov
|
Runcong Zhao
|
Cesare Aloisi
|
Yulan He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Explainability in automated student answer scoring systems is critical for building trust and enhancing usability among educators. Yet, generating high-quality assessment rationales remains challenging due to the scarcity of annotated data and the prohibitive cost of manual verification, prompting heavy reliance on rationales produced by large language models (LLMs), which are often noisy and unreliable. To address these limitations, we present AERA Chat, an interactive visualization platform designed for automated explainable student answer assessment. AERA Chat leverages multiple LLMs to concurrently score student answers and generate explanatory rationales, offering innovative visualization features that highlight critical answer components and rationale justifications. The platform also incorporates intuitive annotation and evaluation tools, supporting educators in marking tasks and researchers in evaluating rationale quality from different models. We demonstrate the effectiveness of our platform through evaluations of multiple rationale-generation methods on several datasets, showcasing its capability for facilitating robust rationale evaluation and comparative analysis.
pdf
bib
abs
LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop
Runcong Zhao
|
Artem Bobrov
|
Jiazheng Li
|
Cesare Aloisi
|
Yulan He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.
2024
pdf
bib
abs
Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
Jiazheng Li
|
Hainiu Xu
|
Zhaoyue Sun
|
Yuxiang Zhou
|
David West
|
Cesare Aloisi
|
Yulan He
Findings of the Association for Computational Linguistics: EMNLP 2024
Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference optimization. Extensive experimental results demonstrate that our framework achieves a 38% assessment performance improvement in the QWK score compared to prior work while producing higher-quality rationales, as recognised by human evaluators and LLMs. Our work sheds light on the effectiveness of performing preference optimization using synthetic preference data obtained from thought tree paths. Data and code are available at: https://github.com/lijiazheng99/thought_tree_assessment.
2023
pdf
bib
abs
Distilling ChatGPT for Explainable Automated Student Answer Assessment
Jiazheng Li
|
Lin Gui
|
Yuxiang Zhou
|
David West
|
Cesare Aloisi
|
Yulan He
Findings of the Association for Computational Linguistics: EMNLP 2023
Providing explainable and faithful feedback is crucial for automated student answer assessment. In this paper, we introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation. We identify the appropriate instructions by prompting ChatGPT with different templates to collect the rationales, where inconsistent rationales are refined to align with marking standards. The refined ChatGPT outputs enable us to fine-tune a smaller language model that simultaneously assesses student answers and provides rationales. Extensive experiments on the benchmark dataset show that the proposed method improves the overall QWK score by 11% compared to ChatGPT. Furthermore, our thorough analysis and human evaluation demonstrate that the rationales generated by our proposed method are comparable to those of ChatGPT. Our approach provides a viable solution to achieve explainable automated assessment in education