Workshop on Innovative Use of NLP for Building Educational Applications (2026)
up
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Ekaterina Kochmar | Bashar Alhafni | Stefano Bannò | Marie Bexte | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Anais Tack | Victoria Yaneva | Zheng Yuan
Ekaterina Kochmar | Bashar Alhafni | Stefano Bannò | Marie Bexte | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Anais Tack | Victoria Yaneva | Zheng Yuan
Theory of Mind and Application in Educational Context
Effat Farhana | Maha Zainab | Qiaosi Wang | Niloofar Mireshghallah | Ramira van der Meulen | Max van Duijn
Effat Farhana | Maha Zainab | Qiaosi Wang | Niloofar Mireshghallah | Ramira van der Meulen | Max van Duijn
This tutorial examines the integration of Theory of Mind (ToM) into AI-driven tutoring systems, with a focus on how large language models (LLMs) can represent learners’ cognitive and emotional states to enable adaptive, personalized feedback. Participants will learn foundational ToM concepts from cognitive science and psychology and how these ideas can be operationalized in AI systems. We discuss mutual ToM, in which both tutors and learners model each other’s mental states, and address challenges including misconception detection, metacognitive modeling, and privacy in data-driven tutoring. The tutorial also includes hands-on demonstrations of machine ToM in programming education using benchmark datasets such as CS1QA and CodeQA. By combining theoretical foundations, empirical insights, and practical exercises, this tutorial will provide an overview of designing human-centered, ethically aware, and cognitively informed AI tutoring systems.
We introduce a thermal–visual fusion approach to improve non-invasive Voice Activity Detection (VAD) for classroom engagement monitoring. In noisy multi-speaker classrooms using a single microphone, acoustic-only methods fail to reliably isolate individual speakers. Our method integrates facial thermal signatures—capturing respiratory and speech-related heat patterns—with visual lip-motion cues to provide an acoustic-independent speech signal. This provides a localized, privacy-preserving, and acoustic-independent indicator of speech activity.This system acts as a visual-diarization frontend, informing Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) systems not only when speech occurs, but precisely which student is speaking. Using up to 19 engineered features, our Thermal-Only Random Forest classifier achieved a Recall of 0.9234 and an F1-score of 0.8105 in subject-independent evaluations, outperforming visual-only baselines. The system was validated as a proof-of-concept on a Raspberry Pi 5 in a controlled laboratory setting, demonstrating real-time feasibility. These results demonstrate that thermal–visual fusion enables more reliable linguistic analysis of collaborative learning and provide critical input for AI agents to facilitate group participation in real-world educational settings that lead to more successful learning outcomes.
Investigating Context-aware CTC for Pronunciation Assessment: Mitigating Peaky Behavior and Context Independency Assumption
Jiun-Ting Li | Tien-Hong Lo | Bi-Cheng Yan | Shih-Hsuan Chiu | Fu-An Chao | Berlin Chen
Jiun-Ting Li | Tien-Hong Lo | Bi-Cheng Yan | Shih-Hsuan Chiu | Fu-An Chao | Berlin Chen
Automatic pronunciation assessment (APA) provides L2 learners with scalable and timely feedback on pronunciation proficiency in a target language, typically through goodness of pronunciation (GOP) features. GOP quantifies how well a pronounced phoneme matches the expected target sound by comparing acoustic features against the model’s posterior probabilities. Traditional GOP relies on forced alignment to obtain these posteriors, but it suffers from acoustic-induced misalignments that degrade assessment reliability. Although the standard CTC-GOP approach bypasses forced alignment, it is limited by the inherent peaky behavior of CTC-based ASR models, which produces sparse posteriors and lacks stable temporal information. To address these issues in standard CTC, we propose a context-aware CTC framework incorporating output context dependency (OCD) in the CTC topology, along with label prior (LP) and maximum conditional entropy (EnCTC) regularization, to mitigate peakiness and produce more stable ASR logits suitable for GOP computation. Experiments on the speechocean762 corpus demonstrate that our best context-aware configurations achieve superior phoneme-level performance, outperforming the TDNN-F baseline and standard CTC in unified GOPT (phoneme PCC 0.641 vs. 0.612; word total PCC 0.582 vs. 0.549) while narrowing the gap in hierarchical HierCB scoring. These improvements widen the scoring margin between correct and mispronounced phonemes from 0.708 to 0.816 in GOPT. They also reveal that mitigating CTC peakiness and incorporating context dependency significantly enhance CTC-GOP stability and robustness, especially for alignment-free APA models.
A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges
Wen Liang | Li Siyan | Zackary Rackauckas | Julia Hirschberg
Wen Liang | Li Siyan | Zackary Rackauckas | Julia Hirschberg
Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing systems along these dimensions. This survey reviews and categorizes automated presentation coaching systems, spanning pronunciation tutors, fluency and prosody coaches, multimodal trainers, and conference Q A practice tools. We introduce a five-dimensional task taxonomy - covering segmental pronunciation, lexical stress, suprasegmental prosody, pacing, and content faithfulness - and explicitly map surveyed systems onto it to reveal coverage gaps. We further review the core technical methods these systems employ: TTS-based exemplar generation and diagnostic methods for pronunciation, prosody, and fluency assessment. Key open challenges include the scarcity of annotated presentation corpora, achieving accent-fair feedback across diverse L1 backgrounds, and delivering low-latency diagnostics for real-time rehearsal.
The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
Li Lucy | Albert Zhang | Nathan Anderson | Ryan Knight | Kyle Lo
Li Lucy | Albert Zhang | Nathan Anderson | Ryan Knight | Kyle Lo
Effective mathematics education requires identifying and responding to students’ mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students’ handwritten, hand-drawn responses to math problems. We find that models’ weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who may require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
Criterial Features in German: Towards Interpretable NLP in Readability Assessment
Denise Loefflad | Sofia Kathmann | Heiko Holz | Detmar Meurers
Denise Loefflad | Sofia Kathmann | Heiko Holz | Detmar Meurers
This paper presents an empirical evaluation of the German Grammar Profile (GGP), a CEFR-aligned resource of criterial features, and its corresponding extraction system PALME. We design a systematic test suite in which each feature extractor is evaluated on controlled positive and negative examples. The results show that PALME achieves high precision and recall across all CEFR levels, with over 90% of features achieving scores above 0.8. Qualitative analysis shows that lower performance primarily results from morphological ambiguity in noun and adjective case marking. To evaluate the usefulness of the criterial features of the GGP for CEFR-aligned readability assessment, we assess their predictive power using Explainable Boosting Machines on graded readers. The model achieves strong performance (precision: 0.75, recall: 0.73). Our qualitative analysis shows that features related to specific verb constructions follow patterns consistent with developmental stages predicted by Processability Theory. These findings underline the value and relevance of criterial features for modeling language development in readability assessment.
Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization
Jaewook Lee | Alexander Scarlatos | Simon Woodhead | Andrew Lan
Jaewook Lee | Alexander Scarlatos | Simon Woodhead | Andrew Lan
With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor–student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners’ needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We train a steering vector using preference optimization: an activation-space direction that guides model responses toward specific tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned scaling coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.
Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM
Younghun Lee | Amir Bralin | Nobel Sanjay Rebello | Dan Goldwasser
Younghun Lee | Amir Bralin | Nobel Sanjay Rebello | Dan Goldwasser
Educational interventions are effective tools for enhancing student learning. While Large Language Models (LLMs) allow for generating adaptive feedback at scale, current studies lack clear methodologies for providing Just-in-Time (JiT) feedback in authentic instructional settings. In this paper, we present a framework that provides adaptive feedback by grounding LLMs with domain-specific expert knowledge. Our approach collects written reasoning logic (strategy essays) from students, analyzes potential error types based on the content of that reasoning, and delivers non-intrusive feedback designed to clarify missing or incorrect concepts. We deploy this framework in a large-scale college course (N > 1,000), where it improved student performance by over 80% compared to previous semesters. Lastly, we validate the framework’s pedagogical utility by analyzing the learning trajectories; we demonstrate how iterative conversations with LLM facilitate shifting one’s misconception to correct understanding.
RABIT: Rationale-Based Distillation Towards Interpretable Automatic Speaking Assessment via a Small Language Model
Bi-Cheng Yan | Hong-Yun Lin | Fu-An Chao | Jiun-Ting Li | Berlin Chen
Bi-Cheng Yan | Hong-Yun Lin | Fu-An Chao | Jiun-Ting Li | Berlin Chen
Automatic speaking assessment (ASA) manages to quantify the language competence of foreign language learners by providing a proficiency score based on their spoken response. Existing efforts in ASA typically employ a neural grader integrated with a set of handcrafted features to assess learners’ oral proficiency from multiple facets. Despite decent performance, the black-box nature of these neural graders remains a significant barrier to providing interpretable explanations for the grading results. In light of this, we propose RABIT for ASA, a novel Rationale-based knowledge distillation framework for interpretable grading decisions via a small language model. Specifically, RABIT first extracts multi-faceted grading rationales from a large language model (LLM) pertaining to the learner’s response and the scoring guidelines. Subsequently, a compact yet efficient language model, equipped with distinct output heads, is jointly optimized to estimate a proficiency score while generating a sequence of grading rationales in an autoregressive manner. A series of experiments conducted on General English Proficiency Test (GEPT) dataset validates the feasibility and superiority of our method over several cutting-edge baselines.
Towards Pedagogically Aligned LLM Tutors for Math Mistake Remediation
Kseniia Petukhova | Tien Dat Nguyen | Ekaterina Kochmar
Kseniia Petukhova | Tien Dat Nguyen | Ekaterina Kochmar
Large language models have strong potential for use in intelligent tutoring systems, but they often fail to follow effective pedagogical strategies, such as guiding students without revealing final answers. We study the application of a two-stage alignment pipeline for math mistake remediation, combining supervised fine-tuning on tutoring dialogs with Direct Preference Optimization on synthetic preference pairs. We construct a dataset that integrates existing tutoring corpora with synthetic data generated along pedagogical dimensions, such as scaffolding and factuality, and study different input configurations that incorporate solution correctness and gold answers. Experiments show that this approach improves both factual accuracy and pedagogical quality over base models and existing tutoring models. Human evaluation further indicates that our best model is competitive with a strong proprietary baseline, while providing additional benefits in terms of openness, transparency, and reproducibility. Our results highlight the effectiveness of preference-based pedagogical alignment, while also revealing challenges in reliably evaluating tutoring quality.
Challenges in Machine Translation of Interactive Multimodal Exercises
Lucie Polakova | Miroslav Hrabal | Věra Kloudová | Michal Novák | Mariia Anisimova | Martin Popel
Lucie Polakova | Miroslav Hrabal | Věra Kloudová | Michal Novák | Mariia Anisimova | Martin Popel
This paper describes linguistic and technological challenges encountered within an applied project aimed at expanding a large e-learning portal from its original Czech to three other languages: Ukrainian, English and German. Although there seems to be a general belief that machine translation is a solved task in 2026, we show that translating educational content, which in our case is highly terminological, multimodal, interactive and encoded in XML, brings along many challenges of different types, some easily solvable and some not. We also compare our results from the early phase of the project (Transformer-based machine translation) with those after the switch to the LLM-based translation methods. We show that both MT methods are prone to different types of errors, some of which are quite new (such as the undesired correction of counterfactual statements) and require new ways of handling them. The resulting four-language edition of the educational web portal will be freely available to educators, students and researchers by the end of 2026.
Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Generative AI is increasingly used to accelerate assessment content development, yet its effectiveness for generating content used in complex assessment tasks for knowledge-rich domains such as medical education is unclear. This study evaluates automated LLM-supported workflows for generating patient-centered communication assessment items that allow students to practice their communication skills. We compared two content generation approaches—constrained linear and exploratory branching—each implemented with and without anchoring in vetted multiple-choice questions (MCQs). Ten subject-matter experts (SMEs) evaluated 80 communication items across six quality dimensions using structured rubrics. The constrained linear approach yielded better ratings than exploratory branching approaches, particularly for medical accuracy and alignment with learning objectives and patient-centered behaviors. MCQ anchoring did not improve medical accuracy. Only a minority of items met all criteria without requiring revision, and no items were unanimously approved by all SMEs. These findings underscore the importance of workflow design in LLM-supported assessment content generation, the continued need for human oversight, and the current limitations of automated content generation in medical education.
Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs
Stefano Banno | Kate Knill | Mark Gales
Stefano Banno | Kate Knill | Mark Gales
Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system’s true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.
Using Interaction Log Data to Evaluate and Improve Feedback Accuracy in an Intelligent Language Tutoring System
Mariia Soliar | Leona Colling | Stephen Bodnar | Detmar Meurers
Mariia Soliar | Leona Colling | Stephen Bodnar | Detmar Meurers
Intelligent Tutoring Systems (ITS) can record learner interactions in fine-grained detail at scale. This opens the door to data-driven methods for investigating system performance and identifying points for improvement. In this paper, we draw on authentic log data from an English language ITS (N_logs = 5646, N_students = 368) to investigate the performance of its feedback algorithm. In step 1 of our analysis, we profiled feedback accuracy by exploring how well the system provided error-specific feedback to malformed student answers in gap-filling grammar exercises using an expert-created set of feedback generation rules. We then identified frequently occurring student errors that triggered incorrect or unspecific feedback and refined the rule set used to detect and respond to these errors with correct specific feedback. In step 2, we validated the rule modifications on an unseen dataset. Comparing the performance of the initial and updated rule sets, we find significant improvement that generalizes to unseen data. Our study thus illustrates how an empirical evaluation of authentic data can complement feedback creators’ expertise by informing rule refinement decisions that yield significant and generalizable performance improvements to feedback in ITS systems.
A Bigger Catch: Fine-Grained Curriculum Standards Alignment on the MathFish Benchmark
Xinman Liu | Mayank Sharma | Xinyu Shi
Xinman Liu | Mayank Sharma | Xinyu Shi
Most existing math benchmarks for LLMs focus on evaluating whether models produce correct solutions. In educational settings, however, it is equally important to understand whether LLMs grasp the pedagogical intent behind math problems, beyond simply arriving at the right answer. Tagging curriculum standards is challenging for the same reason: distinguishing fine-grained standards requires understanding subtle pedagogical distinctions. In this paper, we use the MathFish benchmark, which frames curriculum alignment as a multi-label prediction task over 385 Common Core State Standards, to evaluate a three-stage pipeline inspired by observed failure modes in retrieval and structural reasoning: curriculum-informed hard negatives (M1), a cross-encoder reranker (M2), and a ReAct agent paired with an LLM-as-a-judge critic (M3). We additionally evaluate a training-free alternative (A1) that combines hybrid sparse-dense retrieval with curriculum-graph reranking. M3 achieves 31.3% exact-match accuracy, approximately 6.5× higher than the three-shot GPT-4-Turbo baseline. Error analysis shows that, despite these improvements, the pipeline still struggles with missing predictions, grade-level misalignment, and sibling-standard confusion, reinforcing that precise curriculum alignment remains a fundamentally difficult problem in educational NLP.
Through the Sentence Lens: Explainable Essay Scoring through Fine-Grained Predictions
Daniel Mora Melanchthon | Stefan Keller | Andrea Horbach
Daniel Mora Melanchthon | Stefan Keller | Andrea Horbach
Beyond performance, model transparency is a crucial factor in Automated Essay Scoring, yet current systems often lack explainability, limiting their pedagogical value and users’ trust. Existing explainability methods, such as gradient-based attribution or feature-importance approaches, either produce counterintuitive explanations or are too complex for classroom use. To address this limitation, we make use of fine-grained prediction at the sentence level as a way to enhance explainability. We propose ablation strategies to derive sentence-level pseudo scores from essay-level gold scores and use them to train sentence-level models. We evaluate their performance against essay-level baselines on two datasets (ASAP and MEWS), and compare their sentence-level output to a human baseline. Results indicate a trade-off between essay-level performance and sentence-level granularity. For the language quality trait, most sentence-level models achieve performance comparable to the essay-level baseline, whereas for content, the approach yields more positive results on prompts with shorter
Instruction-Following LLMs for Grammatical Error Correction: Analyzing Neutral-Anchored Instructional Sensitivity Across Editing Modes
Tolgahan Türker | Gülşen Eryiğit
Tolgahan Türker | Gülşen Eryiğit
Grammatical Error Correction (GEC) requires models to make edit decisions under competing objectives: correcting errors while either minimizing changes or maximizing fluency.However, we lack a principled characterization of how instruction-following Large Language Models (LLMs) shift their edit decisions across such editing modes, and whether standard evaluation setups faithfully reflect these shifts.We address this gap by defining three modes—Neutral, Minimal-Edit, and Fluency-Edit—and measuring neutral-anchored performance shifts to quantify instructional sensitivity.We benchmark seven LLMs, including proprietary and open-weight models, in a unified zero-shot prompting schema on CoNLL-2014, BEA-2019, and JFLEG datasets.The Minimal-Edit instruction mitigates over-editing and typically boosts precision; in some settings, strong models also improve recall, suggesting more selective and effective corrections.In contrast, the Fluency-Edit instruction often encourages broader paraphrastic rewriting that may improve perceived fluency while lowering GLEU, suggesting both a metric-objective mismatch and a shift away from targeted local correction.Notably, Claude-Sonnet-4.5 demonstrates superior zero-shot capabilities, outperforming previously reported scores and matching or even exceeding few-shot results across CoNLL-2014 (F_0.5: 67.05), BEA-2019 (F_0.5: 64.91), and JFLEG (GLEU: 66.09).
Assessing the Quality and Consistency of Automated Knowledge Component Generation using Instructor-generated Questions and LLMs
Jordan Esiason | Priyanka Khare | Wookhee Min | Seung Lee | Gamze Ozogul | Xiaoying Zheng | Yeil Jeong
Jordan Esiason | Priyanka Khare | Wookhee Min | Seung Lee | Gamze Ozogul | Xiaoying Zheng | Yeil Jeong
Lecture-style instruction is one of the most prevalent forms of learning in postsecondary education in the United States. Despite the factors that make lectures a convenient format, they tend to present few opportunities for meaningful engagement between students and the course materials being presented due to factors such as the overhead associated with interacting with large numbers of students. By utilizing large language models, we have created a pipeline built upon the ExplainIt classroom response system for processing student self-explanations produced during lectures using automatically generated knowledge components. This pipeline can facilitate deeper engagement with course materials, offer traceability in assessment results, and allows instructors to respond to student errors or misconceptions in real-time during lecture. While previous work using a proprietary large language model has examined the basic functionality of this pipeline, this work more closely examines the consistency and quality of this pipeline using both a large closed-weight model and a smaller open-weight model, with or without retrieval-augmented generation (RAG). The use of open-source models could allow institutions deploying ExplainIt to maintain control of their student data without substantially sacrificing performance. We find that while there are small statistically significant differences in performance between the RAG conditions of each LLM, they are nearly comparable at this task. Additionally, the LLM-generated knowledge components are of higher quality when relevant course material is provided for RAG, although consistency is not improved. These results indicate that both large closed-weight and smaller open-weight models show promise in this task, but fine-tuning may be necessary to improve performance further.
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Longwei Cong | Sonja Hahn | Sebastian Gombert | Leon Camus | Hendrik Drachsler | Ulf Kroehne
Longwei Cong | Sonja Hahn | Sebastian Gombert | Leon Camus | Hendrik Drachsler | Ulf Kroehne
Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen’s kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the partially_correct_incomplete label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.
Using k-Shot Prompting with Large k for the Automated Scoring of a German Written Elicited Imitation Test
Malte Sternik | Ronja Laarmann-Quante | Anastasia Drackert
Malte Sternik | Ronja Laarmann-Quante | Anastasia Drackert
This paper explores the application of a Large Language Model (LLM) using k-shot prompting with large k for automatically scoring a German Written Elicited Imitation Test (WEIT), a test for assessing literacy-dependent procedural knowledge in German as a foreign language. In this test, test-takers are briefly presented with written sentences which they then have to reproduce in writing as accurately as possible. The responses are scored on an ordinal scale which differentiates between different types of errors (e.g. lexical vs. grammatical). We find that with increasing k (in a range from 1 to 700) accuracy increases significantly but it also depends on the drawn sample and varies across different runs of the same prompt. Overall, the k-shot setting which relies on in-context learning without being provided with the scoring rubric outperforms a baseline where only the scoring rubric is provided to the model. However, the LLM does not outperform previous results based on rule-based or BERT-based models.
Kelvi: A Morphological Parser to Support Tamil Literacy
Shankhalika Srikanth | Sabrina Yu | Sophia Chan | Madeline Solis de Ovando
Shankhalika Srikanth | Sabrina Yu | Sophia Chan | Madeline Solis de Ovando
We discuss the development of kelvi.ca, an open source web-based dictionary and morphological parser designed to aid Tamil learners in developing their literacy skills. Tamil is an agglutinative language and heavily suffixal. Existing Tamil dictionaries only carry stems, not conjugated or inflected forms, and for a beginner learner of the language, isolating the stem in an unfamiliar word can be very challenging. Kelvi provides 1) the stem of any input word alongside its definition, and 2) non-technical descriptions of any suffixes that are part of this input, so that learners will gradually start to recognize these suffixes and be able to understand and produce new Tamil words themselves. In detailing our process of collaborative research, user interviews, suffix database creation, and error analysis, we also hope to show that Kelvi can be adapted for other languages and has the potential to be a useful pedagogical aid for learner literacy development, especially for agglutinative and/or polysynthetic languages which tend to be otherwise underserved in the mainstream.
From Questions to Assessment Tuples: A Multi-Agent Framework with Bloom-Specialized Agents and Automated Verification
Gee-Lyle Wong | Runcong Zhao | Yulan He | Jiazheng Li
Gee-Lyle Wong | Runcong Zhao | Yulan He | Jiazheng Li
Automatic question generation with large language models has advanced rapidly, yet producing assessment-ready items, complete with mark schemes and expected answers, remains challenging, especially when generation must reliably target higher-order cognitive levels in Bloom’s Taxonomy. We propose a multi-agent, multi-stage framework that generates structured assessment tuples for both short-answer questions (SAQs) and scenario-based questions (SBQs), combining Bloom-specialized generation agents with staged decomposition and automated verification. We further introduce a rubric-guided LLM-as-a-judge evaluation framework with Bloom-specific alignment metrics. Experiments on university-level AI course material across five generation pipelines show that prompt-level Bloom conditioning alone is insufficient to reliably achieve cognitive control. In contrast, our structured approach yields consistent and notable improvements in alignment, mark scheme quality, and output yield, particularly for higher-order Bloom levels over baseline pipelines.
Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training
Seongjin Park
Seongjin Park
Pronunciation feedback in language learning depends on accurate detection of learner errors, but it is unclear whether modern ASR systems are suitable for this purpose. Their language models recover intended words rather than what was actually pronounced, systematically masking mispronunciations. This is a tendency we call intent bias. By evaluating eight ASR systems spanning three architectures on two L2 English corpora, we find that overcorrection rate correlates inversely with word error rate. In other words, ASR systems with lower WER tend to mask more pronunciation errors. We propose surface-faithful reranking, an inference-time method that uses phoneme-level acoustic similarity to select N-best hypotheses closer to what the learner actually said. Without retraining or access to model internals, the method reduces the false acceptance rate of mispronunciations by 6.0 percentage points on L2-ARCTIC and 5.6 on speechocean762. The improvement is consistent across age groups and first-language backgrounds, though substantial overcorrection remains, pointing to the need for pronunciation-aware ASR objectives.
KEYSCORE — Keystroke-enhanced Automated Essay Scoring
Nils-Jonathan Schaller | Daniel Mora Melanchthon | Thorben Jansen | Olaf Köller | Andrea Horbach
Nils-Jonathan Schaller | Daniel Mora Melanchthon | Thorben Jansen | Olaf Köller | Andrea Horbach
We investigate the predictive power of keystroke logging data for automated essay scoring using the newly collected PISA FLA writing process dataset. Based on 3,882 writing sessions, we extract a comprehensive set of keystroke-based process features, including temporal measures, pause and burst patterns, deletion behavior, production efficiency, and navigation activity and evaluate their ability to predict holistic essay scores on a 0–5 scale. We specifically compare process-feature-based models with content-based scoring approaches trained on data written with and without the help of an AI chatbot, and investigate how predictive power evolves over the course of a writing session by training models at multiple time thresholds.Our analysis reveals that keystroke features provide genuine early predictive signal, capturing aspects of writing fluency and revision behavior that distinguish writers before their texts are long enough to score conventionally. Additionally, our results suggest that process-based scoring is a viable complement to product-based approaches, with promise for formative, real-time feedback during writing.
EduMUSE: A Multimodal Educational Dataset with Automatically Extracted Instructional Context
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle McNamara
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle McNamara
Research in AI applied to education increasingly relies on large-scale, high-quality datasets to support the development and evaluation of learning analytics and intelligent educational systems. Open educational resources provide a promising foundation, yet few datasets integrate structured instructional content with assessment materials in a multimodal form. In this study, we introduce a large-scale multimodal educational dataset (EduMUSE - Educational Multimodal Understanding & Solution Dataset) constructed from OpenStax undergraduate textbooks across multiple domains. The dataset integrates hierarchically structured instructional text, figures, exercises, and, when available, official solutions. For exercises with solutions, we introduce an automatic method that associates each exercise with a focused instructional subsection rather than entire textbook chapters, estimating subsection relevance via solution likelihood under candidate contexts using a vision–language model. We analyze the impact of contextualization on the behavior of vision–language models across different contexts. Results indicate that subsection-level instructional context has a measurable impact on model performance, with variation across model scales and task formulations. The dataset and code are released as open source at https://github.com/upb-nlp/BEA-EduMUSE/ to support reproducible research in multimodal educational modeling and to facilitate generating similar datasets using our approach.
Opportunities and Challenges of LLMs in Education: An NLP Perspective
Sowmya Vajjala | Bashar Alhafni | Stefano Banno | Kaushal Maurya | Ekaterina Kochmar
Sowmya Vajjala | Bashar Alhafni | Stefano Banno | Kaushal Maurya | Ekaterina Kochmar
Fine-Grained Content Zone Prediction in German Argumentative Essays Using LLMs
Xiaoyu Bai | Manfred Stede
Xiaoyu Bai | Manfred Stede
We introduce FDE-Arg, a newly compiled dataset of argumentative student essays in German. We use two Llama models of different sizes to label sentence-level content zones both in FDE-Arg and in an existing dataset of source-dependent argumentative essays. We investigate three approaches for improving model performance: a) Incorporating targeted task information into the prompt text; b) few-shot prompting with up to 10 examples selected on the basis of similarity with the target instance; and c) parameter-efficient fine-tuning. We observe that both incorporating additional information in the prompts and similarity-based few-shot prompting have produced highly promising performance gains over the baseline.
Multi-step Large Language Model for Fine-Grained Feedback in Stepwise Linear Equation Solutions
Imran Chamieh | Torsten Zesch | Klaus Giebermann
Imran Chamieh | Torsten Zesch | Klaus Giebermann
This paper addresses the problem of fine-grained error classification in stepwise algebraic problem solving, with the objective of enabling accurate and timely feedback in large-scale educational environments. Using authentic student response data, we compare a carefully engineered rule-based baseline with large language models (LLMs) in zero-shot and few-shot configurations, as well as multistep LLM-based approaches. We further consider hybrid architectures that combine symbolic computation with LLM inferential processes, with particular emphasis on enhancing the robustness and faithfulness of intermediate representations and mitigating error propagation across successive stages of the computational pipeline. Our empirical results indicate that, although the baseline model delivers strong and reliable performance for narrowly defined error categories, structured multi-step approaches improve performance relative to single-step methods by achieving superior precision, F1 scores, and overall accuracy.
Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Abigail Gurin Schleifer | Moriah Ariely | Beata Beigman Klebanov | Asaf Salman | Giora Alexandron
Abigail Gurin Schleifer | Moriah Ariely | Beata Beigman Klebanov | Asaf Salman | Giora Alexandron
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs’ broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored.We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert.The results show that human–human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.
Using LLMs for item creation: Validating the potential of automatically generated sentence repetition test items for language assessment
Sarah Löber | Björn Rudzewitz | Yuan Chu | Mengyuan He | Shiqin Liu | Yushan Ye | Xiaobin Chen
Sarah Löber | Björn Rudzewitz | Yuan Chu | Mengyuan He | Shiqin Liu | Yushan Ye | Xiaobin Chen
Various aspects of the Elicited Imitation Test (EIT), a sentence repetition task for language assessment, can be automated, for example in terms of test administration or automatic scoring. It is potentially also possible to generate test items with Large Language Models (LLMs). This study investigates the potential of GPT-4o for item creation in the context of EIT, creating a parallel form to two popular and validated tests. We analysed the tests in terms of their linguistic and psychometric properties. While the items created by the LLM show some difference in grammatical structures when compared to human-written items, linguistic complexity results did not differ significantly between tests. Psychometric properties showed only minor differences. These findings lend support to the potential of Automatic Item Generation with LLMs in the context of sentence repetition tasks and might support the process of standardisation in SLA research and testing by enabling parallel test creation.
Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment
Yiyun Zhou | Francis O’Donnell | Victoria Yaneva
Yiyun Zhou | Francis O’Donnell | Victoria Yaneva
Answer explanations for medical multiple-choice questions (MCQs) are a valuable learning tool, but producing them is resource intensive. Writing high quality explanations requires specialized medical expertise and careful alignment with the keyed answer, distractors, and the clinical vignette. This paper evaluates whether a template-aware, retrieval-guided large language model (LLM) workflow can support this production task in a real formative assessment setting. Using a 50-item medical education self-assessment, we compared AI-generated and expert-written MCQ explanations in a blinded study involving eight medical faculty and sixteen medical students. Each participant rated 25 of 50 paired explanations on clarity, amount of information, and structure. The clearest empirical difference was in amount of information: AI-generated explanations were rated significantly higher than expert-written explanations in a cumulative link mixed model analysis (OR = 1.99, 95% CI [1.33, 2.99], p = 0.001). Ratings of clarity and structure did not differ significantly between conditions. Based on faculty ratings, a smaller proportion of AI-generated explanations were judged to require correction (20%) compared with expert-written explanations (38%). These findings suggest that AI can reduce first-draft authoring effort in explanation writing while still requiring expert review to ensure content accuracy.
What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Russian grammar correction models can improveon aggregate benchmarkswhile getting worse at specific grammar rules.We show this through per-rule evaluationon a diagnostic benchmark of 48 prescriptive rules:finetuning on synthetic data improves overall F0.5while driving subordinate-clause comma accuracyfrom 14% to 1%.The suppression is invisible under corpus-level metricsand undetectable with existing coarse, corpus-specific tagsets;it is recoverable only when diagnosed at rule granularity.To enable this analysis,we develop a 98-category error taxonomygrounded in Rozental’s reference grammarand SyntErr, an open-source synthetic data generatorwhose per-rule distribution is an explicit parameter,designed to support arbitrary rule sets and languages.Finetuning eight open models (0.8B–12B)on 39K synthetic examplesyields up to 75.3 F0.5,approaching frontier API modelswith models small enough to run on device.We release the taxonomy, generator,per-rule evaluation data, and all training artifacts.
FinnGEC: Benchmarking Grammatical Error Correction for Finnish
Anh-Duc Vu | Mikhail Zolotilin | Jue Hou | Anisia Katinskaia | Yiheng Wu | Roman Yangarber
Anh-Duc Vu | Mikhail Zolotilin | Jue Hou | Anisia Katinskaia | Yiheng Wu | Roman Yangarber
Grammatical error correction (GEC) is a natural language processing task critical for improving language quality, supporting communication efficacy, and for language learning and teaching. To date, most research in GEC has focused on major, resource-rich languages such as English, while lower-resource languages remain underexplored. In this paper, we focus on GEC for Finnish. We build a dataset based on data from real-world language learners. We explore various approaches to GEC, including fine-tuning transformer models and zero-shot LLM prompting. We also adapt ERRANT, a popular GEC evaluation tool, for the Finnish language, to evaluate the performance of the models. Our results indicate that the performance of GEC for Finnish is promising, but requires further research. To the best of our knowledge, this is the first in-depth exploration of GEC for Finnish; we provide benchmarks, datasets, and code for GEC for Finnish—by releasing our training and test data and the code for Finnish ERRANT—to support further research on this important task.
From Metrics to Meaning: Rule-Grounded LLM Explanations for Data Literacy in the Case of Youth Football
Tomasz Piłka | Tomasz Kuczyński | Mateusz Czajka
Tomasz Piłka | Tomasz Kuczyński | Mateusz Czajka
Young athletes, parents, and coaches are increasingly exposed to training metrics from wearable technology, yet such metrics are difficult to interpret without contextual explanation. We present a rule-grounded data-to-text framework for supporting data literacy in youth football through concise, stakeholder-specific summaries of training sessions. A rule layer maps duration-normalised indicators to structured facts about session profile, internal intensity, speed exposure, and movement dynamics, which are then verbalised by a large language model for coaches, parents, or players. We compare direct generation from raw metrics, generation from rule-derived facts, and an augmented rule-grounded configuration, ENRICHED, that supplements validated facts with raw metrics and explicit threshold definitions. In this setting, selected open-weight models are additionally adapted using LoRA. The framework is developed using 122 anonymised player-session records from a U15 environment and evaluated on a held-out subset of ten sessions with stakeholder-oriented reference summaries. The results indicate that rule grounding improves reliability and audience adaptation compared with direct generation from raw metrics, particularly by reducing unsupported or overly strong interpretations. A school-based expert evaluation with physical education teachers further suggests that player-facing explanations in the evaluated ENRICHED setting can remain accurate, comprehensible, and practically useful. We position the framework as an interpretable data-literacy support interface for youth sport analytics.
Sharing is Caring: Advantages of Sharing a Language Background with Learners as an Annotator of Learner Data in UD
Caroline Grand-Clement | Arianna Masciolini
Caroline Grand-Clement | Arianna Masciolini
This paper looks at the impact of annotators sharing a language background with learners when annotating learner data using the Universal Dependencies (UD) framework. We perform a study comparing annotations by two different annotators working on sets of L2 Swedish sentences (learner sentences and target corrections) from the Swedish Learner Language corpus (SweLL) written by learners for whom French is a main writing language. The annotators are both L2 speakers of Swedish but have different knowledge of French: one is a native French speaker and the other has no knowledge of French. We find high annotator agreement, which may indicate an non-significant impact, though we qualitatively observe an advantage in sharing language background.
Productive struggle is a critical component of mathematics education, requiring students to actively work through ideas rather than just making errors. However, identifying this struggle from text transcripts is challenging because students often mask confusion with epistemic hedging rather than direct statements. Zero-shot large language models exhibit a conservative bias, systematically under-detecting struggle in classroom discourse. We introduce a two-stage NLP pipeline comprising a lexical heuristic gate and an LLM subtype classifier. Our model achieves 90.0% binary accuracy and 84.0% 4-category accuracy. We demonstrate the pedagogical value of this tool by showing that struggle is uniquely concentrated during explicit mathematical reasoning, offering educators a scalable method for root-cause analysis.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
Ravi Kumar | Utkarsh Grover | Xiaomin Lin | Agoritsa Polyzou
Ravi Kumar | Utkarsh Grover | Xiaomin Lin | Agoritsa Polyzou
Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLM’s style with a specific instructor’s tone while maintaining diagnostic correctness remains challenging. We ask: how can we update an LLM for automated feedback generation to align with a target instructor’s style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professor’s grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal-based policy optimization, while deliberately constraining learning to style-bearing components.Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while preserving perfect correctness; for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone, structure, and guidance).
Data-lean fine-tuning of models for evaluating teacher performance in a GenAI-led elicitation simulation
Beata Beigman Klebanov | Andrew Hoang | Jamie Mikeska | Benny Longwill | Sanjna Kashyap | Shreyashi Halder | Aakanksha Bhatia
Beata Beigman Klebanov | Andrew Hoang | Jamie Mikeska | Benny Longwill | Sanjna Kashyap | Shreyashi Halder | Aakanksha Bhatia
Recent advances in the capabilities of conversational agents based on large language models make them a very promising tool for role playing K-12 students in order to train educators in conversational teaching practices, such as eliciting student thinking, explaining disciplinary content, and facilitating a classroom discussion. In fact, such simulations can and have been developed relatively quickly and without data to machine-learn from – neither classroom data nor human-simulated data. To enhance the usefulness and effectiveness of such teaching simulations, it is necessary to provide pedagogically sound, timely, and personalized feedback to the educator about their simulation performance. In this study, we present experiments on fine-tuning models to evaluate educator performance in an elicitation teaching simulation. The models are developed with data collected during usability testing of the simulation and evaluated on real user data. We show that even with relatively little fine-tuning data, robust performance can be obtained
Multi-component student writing profiles for expert-aligned automated evaluation of English learner essays.
Russell Moore | Andrew Caines | Paula Buttery
Russell Moore | Andrew Caines | Paula Buttery
Automated Writing Evaluation (AWE) platforms have become common, but a significant gap remains between automated assessment and expert human feedback. We address this gap by introducing a supervised learning method that uses a multi-component student writing profile (comprising estimated CEFR levels, grammatical error rates, and vocabulary distribution) to align AI scoring with expert human judgements. In the context of an online essay-writing platform for second language learners of English, our model achieves a 36% reduction in RMSE for holistic essay scoring and an 84% improvement in similarity to human-expert annotation of grammatical errors compared to automarker scores (26% and 57% improvement from the best-performing comparable earlier work, by Zaidi et al. (2019) . Furthermore, we demonstrate that the model can predict a student’s final submission profile (CEFR level and grammatical error rate) from earlier drafts and that predictions generalise to a subsequent task, offering new possibilities for automated curriculum planning. Finally, we introduce a visualisation tool that provides educators with clear expert-aligned longitudinal views of student development.
Policy-Sensitive Fairness Evaluation in Automated Scoring of Clinical Communication
Saed Rezayi | Le An Ha | Victoria Yaneva | Polina Harik | Janet Mee | Jason Snyder
Saed Rezayi | Le An Ha | Victoria Yaneva | Polina Harik | Janet Mee | Jason Snyder
This study examines automated scoring fairness in a formative assessment context: the automated evaluation of medical students’ communication skills. Building on the premise that definitions of fairness are value-dependent, we investigate how conclusions about group differences may vary under different weighting schemes for false positives (FPs) and false negatives (FNs). Results show that when errors are treated symmetrically, no statistically significant differences are observed across demographic groups based on race or gender. This pattern remains stable when error weights are varied, with no consistent or robust disparities emerging. A small number of isolated differences appear under moderate FN weighting. Overall, the findings suggest that fairness conclusions in this setting are relatively robust to variations in error weighting. At the same time, the study highlights the importance of making value assumptions explicit when evaluating automated scoring systems, particularly in formative contexts where error trade-offs carry pedagogical implications for feedback, learner engagement, and educational equity.
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
Haziq Khalid | Salsabeel Shapsough | Imran Zualkernan
Haziq Khalid | Salsabeel Shapsough | Imran Zualkernan
Generating diverse, pedagogically valid stories for Arabic early-grade reading assessments requires balancing tight constraints on vocabulary, reading level, and narrative structure against the need to avoid repetitive plots that undermine assessment validity. We investigate noise steering, injecting calibrated Gaussian perturbations into the internal representations of transformer models at inference time, as a training-free diversity method evaluated across five small Arabic-centric language models (7–9B parameters). We compare four injection strategies against high-temperature sampling baselines, measuring diversity, quality, constraint adherence, and reading grade level. Residual stream noise consistently improves narrative diversity with minimal quality or constraint cost and preserves early-grade reading level across all Arabic-centric models. Attention entropy noise injection (AENI) stabilizes the otherwise unreliable attention-logit noise while recovering quality. High-temperature sampling inflates reading grade level and causes catastrophic collapse on several models. We find internal representation-level perturbation to be a more suitable diversity strategy than output-level stochasticity for constrained educational content generation.
The Effects of Structured LLM-Generated Feedback on Programming Assignment Performance
Tsvetomila Mihaylova | Evanfiya Logacheva | Arto Hellas | Jing Fan | Francisco Castro | Bita Akram | Narges Norouzi | Peter Brusilovsky | Juho Leinonen
Tsvetomila Mihaylova | Evanfiya Logacheva | Arto Hellas | Jing Fan | Francisco Castro | Bita Akram | Narges Norouzi | Peter Brusilovsky | Juho Leinonen
When programming students encounter errors in their code, compiler messages or static analysis output often provide limited guidance, particularly for novice programmers. Personalized feedback from instructors can be effective but does not scale well. Recent advances in large language models (LLMs) enable automated feedback generation at scale.This study examines whether LLM-generated feedback with different levels of guidance is associated with differences in students’ problem-solving behavior. We analyze effects on time to solution and number of attempts, and examine whether these effects differ by programming experience. We design three feedback types and compare them to a baseline in which students receive only compiler error messages. Results from an online programming course show that LLM-generated feedback is associated with faster time to solution compared to the no-feedback baseline, with less guided feedback showing slightly stronger effects. Overall, the findings suggest that feedback structure plays an important role in how students progress toward correct solutions and motivate further work on adaptive feedback designs and longer-term learning outcomes.
Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student Dialogues
Shuyan Huang | Alexander Scarlatos | Jaewook Lee | Andrew Lan
Shuyan Huang | Alexander Scarlatos | Jaewook Lee | Andrew Lan
Recent advances in large language models (LLMs) have led to the development of AI-powered tutoring systems that provide interactive support via dialogue. To enable these tutoring systems to provide personalized support, it is essential to assess student performance at each turn, motivating knowledge tracing (KT) in dialogue settings. However, existing dialogue-based KT approaches often ignore question difficulty and rely on opaque LLM latent representations, hindering accurate and interpretable prediction. In this work, we propose an interpretable difficulty-aware conversational KT framework that leverages LLMs to explicitly model student knowledge state and the difficulty of tutor-posed tasks at each dialogue turn. The framework incorporates the original question and the next tutor-posed task to estimate the student’s knowledge state and the difficulty of the upcoming turn. It further integrates Item Response Theory to map LLM outputs into student ability and question difficulty parameters, enabling interpretable prediction of student performance grounded in cognitive theories of learning. We evaluate the framework on two tutor-student dialogue datasets. Quantitative and qualitative results show that our framework outperforms existing KT baselines, meanwhile generating interpretable outputs consistent with cognitive theory. Our code and data are available at https://github.com/umass-ml4ed/Difficulty-Aware-DialogKT.
Rubrics as Semantic Subspaces: A Unified Approach to Rubric-based Constructed Response Scoring across Short Answers and Essays
Sebastian Gombert | Sonja Hahn | Nico Andersen | Leon Camus | Zhifan Sun | Ngoc Nhu Hao Nguyen | Fabian Zehner | Longwei Cong | Alexander Mehler | Hendrik Drachsler
Sebastian Gombert | Sonja Hahn | Nico Andersen | Leon Camus | Zhifan Sun | Ngoc Nhu Hao Nguyen | Fabian Zehner | Longwei Cong | Alexander Mehler | Hendrik Drachsler
Rubrics are the primary reference for manual scoring of constructed responses, and there is growing interest in their use in automated scoring methodologies. In this work, we propose Aspect-Grounded Rubric–Answer Alignment (AGRAA), a rubric-based end-to-end scoring framework that models rubric descriptors as latent aspect spaces. Concretely, rubric descriptors are represented as low-dimensional subspaces derived from contextualised transformer embeddings, and student responses are scored according to how strongly their representations align with these rubric-induced spaces relative to the residual space outside them. This formulation provides a geometrically grounded interpretation of rubric-based scoring while enabling end-to-end training with standard transformer encoders. We introduce three distinct architectural variants and evaluate them on multiple short-answer and essay scoring datasets. Across these tasks, AGRAA achieves predictive performance highly competitive with strong neural and feature-based baselines. In addition, the framework yields interpretable intermediate representations that expose which rubric-defined aspects contribute to scoring decisions, enabling decision-aligned explanations grounded in rubric descriptors.
Domain-Adaptive Pre-training for Automated Short Answer Grading in Conceptual Physics: Reliability, Question-Level Analysis, and Error Reduction
Shirin Lade | Alistair Willis | Jonathan Nylk | Oli Howson
Shirin Lade | Alistair Willis | Jonathan Nylk | Oli Howson
This paper investigates whether automated short answer grading can reliably support teachers when marking conceptual physics responses in settings with limited labelled data. Using free-text responses derived from Force Concept Inventory-style questions, the study shows that incorporating subject-specific knowledge improves grading consistency, particularly in early deployment scenarios. The system reduces grading errors and provides more reliable agreement with reference judgments, especially for more challenging questions. These results suggest that automated grading can assist teachers by supporting marking decisions and prioritising responses for review, while still requiring human oversight.
Measuring Optimal Challenge: Trajectory-Based Difficulty Alignment in Open-Ended Language Tutoring
Ziqi Shu | Shuman Wang | Michael Hardy
Ziqi Shu | Shuman Wang | Michael Hardy
Conversational English as a Foreign Language (EFL) tutoring relies on dynamically generated exercises rather than fixed item banks, so traditional difficulty estimation cannot verify whether a task is appropriately calibrated to a learner. We propose a framework that measures difficulty alignment directly from observable interactional behavior, classifying each exercise into one of three states (Under-Challenged, Optimally Challenged, or Over-Challenged) based on turn-level sequences of student attempts, errors, confusion, and tutor scaffolding. Using 1,566 exercises from the Teacher-Student Chatroom Corpus, we validate the classification against human annotation (Cohen’s kappa = 0.79 at the state level) and show that a learner’s cumulative trajectory of these states predicts success on subsequent exercises. Aggregating these predictions into a within-session capability-shift proxy, we find that sessions with higher proportions of over-challenging exercises systematically yield lower estimated shifts, while optimally challenging interactions are significantly associated with greater improvement than under-challenging ones — patterns consistent with Krashen’s Input Hypothesis.
PeerMathDial: A Middle School Dialogue Dataset for Student Collaborative Math Problem Solving
Murong Yue | Desmond Mcglone | Emily Slutz | Wenhan Lyu | Yixuan Zhang | Jennifer Suh | Ziyu Yao
Murong Yue | Desmond Mcglone | Emily Slutz | Wenhan Lyu | Yixuan Zhang | Jennifer Suh | Ziyu Yao
Collaborative Problem Solving (CPS) is a core skill in education, where the process of peer interaction is highly important. However, existing educational dialogue datasets mostly focus on classroom instruction or tutoring (i.e., teacher/tutor-student interaction), yet datasets centering small-group, student-student interaction are limited. This thus leaves research with limited resources for studying how students interact, coordinate, and solve problems together in real educational settings. To address this, we introduce PeerMathDial, the first dataset of peer CPS dialogues collected from authentic middle school math classrooms. It contains 55 dialogues from 27 students, totaling 6,406 turns. To facilitate research on CPS discourse analysis, we further build a corpus-grounded dialogue act taxonomy assisted by LLMs. Using the dataset and the dialogue act taxonomy, we demonstrate the practical applications of PeerMathDial across three use cases. First, we track how dialogues evolve over time and measure the impact of teacher interventions. Second, we align dialogue actions with student surveys to reveal the connection between students’ traits (e.g., confidence, leadership) and their actual behaviors. Third, by evaluating LLMs on dialogue act prediction, we glimpse at the potential of LLMs for student simulation in educational applications. Our dataset and source code will be released to the community.
Effects of Varying LLM Access on Essay Writing Behavior
Julia Christenson | Karin de Langis | Shirley Anugrah Hayati | Dongyeop Kang
Julia Christenson | Karin de Langis | Shirley Anugrah Hayati | Dongyeop Kang
Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (<=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.
Assessment of L2 speech global dimensions using large audio language models
Elsayed Issa | Mahmoud Ali
Elsayed Issa | Mahmoud Ali
Large audio language models (LALMs) integrate audio representations with large language models to enable unified understanding of spoken content. Their capabilities have been increasingly investigated across several benchmarks; however, the examination of their use in rating L2 speech is still in its infancy. This study explores the abilities of LALMs in scoring three L2 speech global dimensions: foreign accentedness, comprehensibility, and intelligibility. Ninety audio samples produced by L2 speakers were rated by ten native speaker raters as well as five LALM models. Model performance was evaluated against the human composite mean using Pearson r, Spearman p, mean absolute error (MAE), and systematic bias, with the human leave-one-out correlation (r = .46-.73 across dimensions) serving as an empirical performance benchmark. The results showed that no LALM reached human-level performance on any dimension. Only one model (i.e., Gemini) achieved a significant correlation with human ratings on comprehensibility (r = .28, p < .01), while Qwen2-Audio showed modest correlation on intelligibility (r = .32, p < .01). MAE ranged from 0.75 to 3.99 for accentedness (human: 1.24), 1.35 to 3.00 for comprehensibility (human: 1.24), and 12.03 to 15.43 for intelligibility (human: 8.49). All models exhibited systematic biases, with deviations ranging from -9.31 to +13.19 points. The paper concludes with a discussion of the implications for automated L2 speech assessment.
Incentives Of EdTech: A Systematic Review Of EduNLP Research
Gabrielle Gaudeau | Aoife O’Driscoll | Jasper Degraeuwe | Andrew Caines | Donya Rooein | Zeerak Talat
Gabrielle Gaudeau | Aoife O’Driscoll | Jasper Degraeuwe | Andrew Caines | Donya Rooein | Zeerak Talat
While the Natural Language Processing community has dedicated significant resources in developing educational technologies (EdTech) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education. In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics’ Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private-sector incentives and the foundational needs of educational infrastructure. Our analysis reveals that teachers are systematically under-represented as beneficiaries of research (33.3%) despite being the most affected, that real-world deployment remains rare (9.8%), and that ethical engagement tends toward acknowledgement rather than action. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices.
Children’s English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
Qian Shen | Fanghua Cao | Min Yao | Shlok Gilda | Bonnie Dorr | Walter Leite
Qian Shen | Fanghua Cao | Min Yao | Shlok Gilda | Bonnie Dorr | Walter Leite
Large Language Models (LLMs) are widely applied in educational practices, such as for generating children’s stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children’s reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children’s English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children’s interests, controllable difficulty and safety.
Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment
Sarh Alzu’Bi | Robert Reynolds
Sarh Alzu’Bi | Robert Reynolds
Arabic readability assessment is under-explored compared to English, and existing models are typically evaluated only within the training domain. We introduce the Jordanian School Textbook Corpus (JSTC), 82,512 segments from 240 textbooks spanning grades 1–12, and combine it with DARES to train XGBoost classifiers, fine-tuned CAMeLBERT transformers, and hybrid architectures evaluated both in-domain and on the BAREC out-of-domain benchmark. CAMeLBERT achieves strong in-domain performance (QWK = 0.830) but its cross-domain QWK collapses to 0.085, while XGBoost over 127 handcrafted linguistic features alone maintains the highest cross-domain QWK (0.240); adding [CLS] embeddings to those features actively harms transfer. Probing reveals that CAMeLBERT layers implicitly capture some linguistic features but higher-level signals overwhelm them, and Captum attribution identifies nouns and nominal particles such as al- as the most important tokens. The results argue for prioritizing linguistically-grounded features over contextual embeddings when cross-domain robustness is required.
Predicting Item Difficulty and Generating Reading Comprehension Items via an Annotated Repository
Radhika Kapoor | Mayank Sharma | Sang Truong | Nick Haber | Ben Domingue | Maria Ruiz-Primo
Radhika Kapoor | Mayank Sharma | Sang Truong | Nick Haber | Ben Domingue | Maria Ruiz-Primo
Prediction of item difficulty from its text content is of substantial interest for automated generation of test items. In this paper, we focus on the related problem of recovering IRT-based difficulty when the data originally reported item p-value (percent correct responses). We model this item difficulty using a repository of reading passages and student data from US standardized tests from New York and Texas for grades 3-8 spanning the years 2018-23. This repository is annotated with meta-data on (1) linguistic features of the reading items, (2) test features of the passage, and (3) context features. Using a penalized regression model, we achieve an RMSE of 0.59 (compared to a 0.92 baseline) and a 0.77 correlation between true and predicted difficulty. We further evaluated the impact of LLM embeddings (ModernBERT, BERT, and LLaMA), finding that they marginally improve performance but function effectively as standalone alternatives to traditional linguistic features. Finally, we demonstrate how this difficulty prediction model powers a publicly available, human-in-the-loop tool for generating reading comprehension items.
Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
Grandee Lee | Yue Wang | Che Yee Lye | Luke Peh
Grandee Lee | Yue Wang | Che Yee Lye | Luke Peh
When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM’s scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance (r = 0.698) with systematic positive bias. GEA is strong (r > 0.7) for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.
Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory
Aron Gohr | Marie-Amelie Lawn | Kevin Gao | Inigo Serjeant | Stephen Heslip
Aron Gohr | Marie-Amelie Lawn | Kevin Gao | Inigo Serjeant | Stephen Heslip
Large language models can generate feedback on free-form student writing, but it is unclear whether such feedback is correct and pedagogically useful. We evaluate LLM-generated feedback on 65 undergraduate proof-writing exercises using Hattie and Timperley’s feedback framework and a grade agreement metric, comparing two models (GPT-4.1, GPT-5) under two workflow configurations graded by two independent LLM evaluators. A mark-scheme-augmented workflow improves grade correlation with human experts for both models, and its precomputed mark schemes allow instructors to audit the system before deployment. GPT-5 produces higher-quality feedback across all dimensions. The metrics we collect give some evidence that in the setting studied, feedback quality is high, and several sanity checks on our experiments support this finding. However, providing meaningful self-regulation support and controlled tests with students remain to be done. The results in this contribution show that feedback theory provides a useful lens for evaluating automated mathematical feedback.
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
Tahreem Yasir | Wenbo Li | Sam Gilson | Sutapa Tithi | Xiaoyi Tian | Tiffany Barnes
Tahreem Yasir | Wenbo Li | Sam Gilson | Sutapa Tithi | Xiaoyi Tian | Tiffany Barnes
Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution–feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.
Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education
Mragisha Jain | Tirth Bhatt | Griffin Pitts | Aum Pandya | Peter Brusilovsky | Narges Norouzi | Arto Hellas | Juho Leinonen | Bita Akram
Mragisha Jain | Tirth Bhatt | Griffin Pitts | Aum Pandya | Peter Brusilovsky | Narges Norouzi | Arto Hellas | Juho Leinonen | Bita Akram
Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students’ algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE’s feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.
LLM-Powered but Rule-Grounded: Pedagogically Relevant Grammatical Error Characterization for Learner Model Construction
Soroosh Akef | Amália Mendes | P Rebuschat | Detmar Meurers
Soroosh Akef | Amália Mendes | P Rebuschat | Detmar Meurers
Grammatical error correction approaches rarely characterize the pedagogical value of corrected errors. We propose a framework that combines LLM-based second-language writing correction with a rule-based characterization module to identify pedagogically relevant, fine-grained grammatical properties in learner texts. The characterization module targets 252 European Portuguese properties which are categorized by the CEFR level at which they are taught according to an authoritative curriculum, and property accuracy is inferred from contrasts between the learner and corrected texts. We evaluate the framework extrinsically by training interpretable automatic proficiency assessment models on accuracy features extracted from characterized errors in a Portuguese learner corpus. Across different prompting strategies, we show that models trained on features derived from LLM-corrected texts perform similarly to those trained on features derived from annotator-corrected texts and comparably to models trained on linguistic complexity features. Feature importance overlap is likewise high, and similar predictive patterns are observed in both annotator-based and LLM-based models, further supporting the validity of the proposed framework.
Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation
Mariam Barakat | Ekaterina Kochmar
Mariam Barakat | Ekaterina Kochmar
We present a modular pipeline for educational analogy generation, decomposed into four stages – source finding, sub-concept generation, explanation generation, and evaluation – grounded in Structure Mapping Theory. Evaluating 12 LLMs across six model families on SCAR and ParallelPARC, we find that sub-concept grounding substantially improves retrieval precision and explanation quality but offers limited benefit in open-ended generation. We further validate an LLM-as-a-judge methodology against human annotations, finding that Claude Sonnet 4.6 aligns more reliably with human rankings than with absolute scores. Our results highlight cross-stage interactions that isolated studies cannot capture, and position sub-concept grounding as a key driver of analogy quality.
Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction
Takumi Goto | Yusuke Sakai | Taro Watanabe
Takumi Goto | Yusuke Sakai | Taro Watanabe
Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets and LLM inference.
Zero Shot Phonics: Evaluating Constraint-Adherent Phonics Story Generation in Large Language Models
Maria Monica Manlises | Ethel Ong
Maria Monica Manlises | Ethel Ong
Phonics stories are essential for early literacy, requiring controlled repetition of grapheme-phoneme (GP) patterns while maintaining simplicity, suitability, and quality. Generating such texts poses a challenge for large language models (LLMs), which must balance multiple phonological and pedagogical constraints. We evaluate six LLMs in a zero-shot setting across 16 prompt configurations, producing 8,688 outputs and 39,096 stories. Outputs are assessed using a multi-dimensional framework covering phonological alignment, developmental lexical appropriateness, readability, and narrative quality. Results show that while LLMs generate highly readable and age-appropriate text, they exhibit variability in phoneme control and narrative coherence. Prompt design significantly affects performance, revealing trade-offs between focusing on multiple phonological, linguistic, and pedagogical constraints, while model choice also leads to significant differences. These findings highlight the challenges of controllable educational text generation and the importance of prompt design in balancing instructional objectives. We release our prompts, generated stories, and evaluation framework to support future work in phonics-based story generation for early readers.
From Dialogue to Learner Modeling: Identifying Candidate Signals of Productive Use in LLM-Based Grammar Practice
Luisa Ribeiro-Flucht | Lanhua Huang | Xiaobin Chen
Luisa Ribeiro-Flucht | Lanhua Huang | Xiaobin Chen
Adaptive language-learning systems often model progress through correctness in constrained exercises, where the target response is predefined. In dialogue-based tutors, by contrast, learners can respond appropriately in many ways, making evidence of progress harder to interpret. This raises a learner-modeling problem: determining whether learner production provides useful evidence of progress, which aspects are informative, and how they might support adaptation. We address this problem using pilot data from an LLM-based English grammar tutor, comprising 40 pre- and post-test tasks, treatment interactions, and 2,406 learner messages. We propose a coding scheme for learner production in dialogue and explore whether the resulting evidence types can support future adaptive decisions. Findings show that learner production in dialogue can support adaptive grammar practice: prior target use predicted short-term performance, while finer-grained evidence helped distinguish different levels of productive control. We discuss implications for adaptive grammar-based dialogue systems that use learner production to support communicative practice.
Evaluating Adaptive Personalization of Educational Readings with Simulated Learners
Ryan Woo | Anmol Rao | Aryan Keluskar | Yinong Chen
Ryan Woo | Anmol Rao | Aryan Keluskar | Yinong Chen
We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. Unlike typical intelligent tutoring systems that adapt questions or feedback, we treat reading as the primary intervention and use question answering only as an observation channel for Bayesian Knowledge Tracing (BKT). This enables controlled comparison of LLM-powered adaptive and non-adaptive reading policies before classroom deployment.The framework links open educational content to a shared ontology of learning objectives and knowledge components, which is used to generate aligned reading–assessment pairs targeting one objective at a time. Simulated learners update their knowledge through a comprehension-and-memory process that models encoding, integration with prior knowledge, and misconception revision.The learner model combines established theories of reading with constrained answer selection, ensuring responses are generated only from information the learner has plausibly retained. Together, these components provide an interpretable offline testbed for studying whether adaptive reading improves learning outcomes.
Toward Cross-Domain Automated Feedback: A Comparative Evaluation of Open-Source Models across Diverse Student Assessment Types
Muhammad Haseeb | Min Paing Hmue | Ahmad Imam Amjad | Maaz Amjad | Victor Sheng
Muhammad Haseeb | Min Paing Hmue | Ahmad Imam Amjad | Maaz Amjad | Victor Sheng
Constructive, personalized, and timely feedback is essential to student learning. However, providing such feedback in large classes remains a major challenge. Large language models (LLMs) offer alternatives to support instructors by generating relevant feedback that highlights both student strengths and areas for improvement. Nevertheless, most existing LLM-based feedback systems rely on proprietary APIs and are often tailored to specific tasks, which limits their accessibility, flexibility, and applicability in resource-constrained educational settings. In this study, we investigate the potential of two open-source LLMs (DeepSeek R1 and Qwen 3.5) to support automated feedback generation across three disciplines (e.g., programming assignments, essays, and mathematics problems). We evaluate two prompting strategies (unified and multi-agent) across these domains and use human judgment criteria to assess feedback quality. Through this analysis, we examine the potential of open-source models as practical, scalable alternatives to proprietary API-based systems for developing freely accessible feedback-generation tools. Our results show that a single open-source model can generate useful feedback across diverse domains, though with varying effectiveness. DeepSeek R1 performs better on reasoning-intensive tasks such as mathematics, while Qwen 3.5 works best for holistic tasks such as writing, but both models struggle with programming tasks. We find that model architecture matters more than prompting strategy, and reasoning-optimized models excel in structured domains, while general-purpose models perform better on holistic tasks. Finally, we conclude that a multi-agent approach does not consistently guarantee improved results over a single unified LLM approach.
Findings of the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners
Mariano Felice | Lucy Skidmore
Mariano Felice | Lucy Skidmore
This paper reports findings from the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners across three L1s (Spanish, German and Mandarin). The task featured open and closed tracks, using data from the British Council’s Knowledge-based Vocabulary Lists (KVL). Submissions were received from 23 teams employing diverse modelling approaches, including transformers, Large Language Models, feature-based approaches and ensembles. Results were evaluated using RMSE, with winning systems significantly exceeding the baseline and establishing new state-of-the-art benchmarks. This paper offers an examination of the participating systems, performance across tracks and L1s, and the factors that can affect prediction accuracy.
SATLab at BEA 2026 Shared Task 1: Predicting the Difficulty of English Words for Three L1 Learners Using Primarily Psycholinguistic Features
Yves Bestgen
Yves Bestgen
This paper presents SATLab’s participation in the BEA 2026 shared task on predicting the difficulty of English words for L2 learners. The proposed system uses features mainly derived from word frequency lists, lexical norms, and psychometric data, which are input into a gradient boosting decision tree model. It outperformed the Baseline system but performed significantly worse than the top-performing teams. Feature contributions to model performance are analysed using gain scores and Spearman rank correlations, and a brief analysis of the most significant errors is provided.
UGA Threshold at BEA 2026 Shared Task 1: Predicting Vocabulary Acquisition Difficulty with Hand-Crafted SLA-Based Features
Emma Dalbo
Emma Dalbo
This paper describes a feature-based system submitted to the BEA 2026 Shared Task on Vocabulary Difficulty Prediction (closed track). The system models vocabulary difficulty for English learners using linguistically motivated features capturing frequency, cross-linguistic similarity, phonological and orthographic complexity, and semantic properties, supplemented by multilingual embeddings (reduced via PCA). Multiple regression models were evaluated using cross-validation, with final predictions generated from ensemble and single-model configurations per language.The system achieves competitive performance across all three L1 groups (German, Spanish, and Chinese), outperforming the XLM-RoBERTa baseline in seven of nine runs in terms of RMSE, with the strongest gains observed for Chinese and more modest improvements for Spanish. An ablation study further demonstrates that frequency and cross-linguistic similarity factors contribute most substantially to predictive performance, with effects varying across L1s. These findings highlight the role of interpretable linguistic features in modeling vocabulary difficulty in an L1-aware setting.
TeamXBC at BEA 2026 Shared Task 1: How AI (and I) won the shared task: Vibe and agentic coding solutions for practical machine learning problems
Xiaobin Chen
Xiaobin Chen
The paper describes how the author used AI coding agents and a technique called vibe coding to successfully tackle the BEA 2026 shared task on vocabulary difficulty prediction. Three sets of predictions (runs) were submitted to the competition, corresponding to three experiments the author ran by giving the coding agent different levels of agency: (1) a one-off solution fully planned and implemented by the AI, (2) an AI self-determined iterative process that ran for 24 hours, and (3) a collaborative human-in-the-loop process where solutions were discussed between the author and the AI. Competition results showed that the collaborative mode delivered the best performance, demonstrating that at the current stage domain expert input and decision making are important and necessary for vibe coding solutions to practical machine learning problems.
SAAKTH at BEA 2026 Shared Task 1: L1-Aware English Vocabulary Difficulty Prediction with Hybrid Transformer and Psycholinguistic Features
Karthik Mattu | Adit Dhall | Arshad Naguru | Shubh Sehgal | Thejas Gowda | Hakyung Sung
Karthik Mattu | Adit Dhall | Arshad Naguru | Shubh Sehgal | Thejas Gowda | Hakyung Sung
This paper presents team SAAKTH’s system for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction (Closed Track). We address the key challenge that English word difficulty is not fixed but varies with English learners’ native language. Our approach combines a fine-tuned XLM-RoBERTa-large encoder with handcrafted psycholinguistic features engineered separately for each L1 group. These features are integrated via a shallow multilayer perceptron and optimized separately per L1, with five-seed ensembling and XGBoost-based blending for stability. Our system achieves RMSEs of 0.997 (es), 1.002 (de), and 0.932 (cn) on the development set, improving 20–25% over the baseline. Results highlight the effectiveness of L1-aware modeling under limited data.
SurreyCTS at BEA 2026 Shared Task 1: Semantic Funnelling and Entropy-based Multilingual Lexical Difficulty Prediction
Georgina Willoughby | Jordan Painter | Diptesh Kanojia | Emily Wells | Constantin Orasan
Georgina Willoughby | Jordan Painter | Diptesh Kanojia | Emily Wells | Constantin Orasan
We describe the SurreyCTS system for the BEA 2026 shared task on lexical difficulty prediction. Our approach combines multilingual transformer encoders (RemBERT and COMET) with engineered linguistic features including semantic funnelling, lexical similarity, attention-derived signals, and language-aware representations. A weighted ensemble of the five strongest systems placed fifth among open-track teams, outperforming the open-track baseline across all three learner L1 groups (Spanish, German, and Chinese).
EduNLP at BEA 2026 Shared Task 1: Multi-Model Ensemble with Feature-Augmented Transformers for Vocabulary Difficulty Prediction
Avinash Kumar Sharma
Avinash Kumar Sharma
We describe our system submitted to the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners. Our approach combines handcrafted linguistic features with fine-tuned XLM-RoBERTa transformers in a multi-model ensemble, participating in both the closed and open tracks. Our system outperforms the baselines on both tracks across all three L1s, with best RMSEs of 1.058 (closed, CN) and 0.992 (open, CN). Post-hoc error analysis reveals that polysemous words in rare senses and nominalized -ing forms constitute the primary failure mode.
AIDA at BEA 2026 Shared Task 1: A Two-Stage Framework for L1-Aware Vocabulary Difficulty Prediction with Representation Diversity and Residual Calibration
Seok Hyeon Cho | JunHyeok Choi | Sangeun Ji | Sung Won Han
Seok Hyeon Cho | JunHyeok Choi | Sangeun Ji | Sung Won Han
We study vocabulary difficulty prediction for second language (L2) learners, a key component for adaptive language learning and assessment. Existing approaches often treat difficulty as an intrinsic property of words or contexts, overlooking representation-dependent variation and learner-specific factors such as L1 transfer.We participate in the BEA 2026 Shared Task Closed Track using the Spanish (L1) subset of the KVL dataset. We propose a two-stage framework that decouples representation learning from learner-aware calibration. Stage 1 constructs diverse representations using multiple pretrained encoders with varied pooling and prediction strategies, capturing complementary aspects of lexical and contextual complexity. Stage 2 models systematic residual errors with psycholinguistic and cross-lingual features, enabling explicit correction of prediction biases.Experiments show that our method outperforms strong baselines, improving RMSE (1.257 -> 0.976) and correlation (0.765 -> 0.857). These results highlight the importance of jointly modeling representation diversity and learner-specific effects. Our system ranked 3rd in the official BEA 2026 Shared Task Closed Track.
Failure at BEA 2026 Shared Task 1: One Pipeline, Three L1s: A Unified Language-Agnostic System for Vocabulary Difficulty Prediction
Abid Hossain | Kamruzzaman Khan Alve
Abid Hossain | Kamruzzaman Khan Alve
We present a unified, language-agnostic system for the BEA 2026 Shared Task on vocabulary difficulty prediction. The system uses a single training pipeline across Spanish, German, and Mandarin Chinese without any language-specific adaptation. Input features include serialized text fields and four scalar length-based features, processed using an XLM-RoBERTa encoder with attention-mask-weighted mean pooling. Hyperparameters are tuned with Optuna under reduced cross-validation, followed by full 5-fold training and checkpoint-based ensembling.Our approach improves over the official closed-track baseline across all three L1 conditions, demonstrating that a shared architecture and training strategy can yield consistent gains without language-specific engineering. Error analysis shows higher prediction error at difficulty extremes, suggesting a regression-to-the-mean tendency.
BoostedCats at BEA 2026 Shared Task 1: What Makes a Word Hard to Learn? Modeling L1 Influence on English Vocabulary Difficulty
Jonas Mayer Martins | Zhuojing Huang | Aaricia Herygers | Lisa Beinborn
Jonas Mayer Martins | Zhuojing Huang | Aaricia Herygers | Lisa Beinborn
What makes a word difficult to learn, and how does the difficulty depend on the learner’s native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word’s familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.
uogal at BEA 2026 Shared Task 1: Ensemble of Multilingual Encoders with NMT Augmentation for L1-Aware Vocabulary Difficulty Prediction
Bernardo Stearns | John P. McCrae | Thomas Gaillat | Jefkine Kafunah
Bernardo Stearns | John P. McCrae | Thomas Gaillat | Jefkine Kafunah
We submit a system for the closed track of the BEA 2026 shared task on L1-aware vocabulary difficulty prediction (Spanish, German, Mandarin Chinese). We compared three families of approaches: hand-crafted tabular features with tree-based regressors, fine-tuned multilingual encoders, and decoder-based artificial learner simulation using LoRA-tuned Pythia models, each evaluated with and without NMT-augmented English context. Our best system is an ensemble of four base and four NMT-augmented multilingual encoders combined through per-language stacking (Nelder-Mead and ElasticNet meta-learner), which placed 2nd in the closed track across all three languages. We also report a monotonic scaling study of the decoder-based artificial learner simulation pipeline.
Jinnie’s Lab at BEA 2026 Shared Task 1: Precalibration of Vocabulary Item Difficulty with Multilingual Transformers and Multi-Task Learning
Zhe Li | Pauline Aguinalde | Jinnie Shin
Zhe Li | Pauline Aguinalde | Jinnie Shin
This paper describes our submission to the BEA 2026 shared task 1 on vocabulary item difficulty prediction in multilingual settings. We investigated whether transformer-based representations learned directly from item content can support the prediction of vocabulary item difficulty across different L1 groups. Our approach adopted a multilingual BERT-based architecture, specifically the mmBERT, with representation augmentation at both the layer and token levels, followed by a multi-task cascade learning that incorporates part-of-speech information as an auxiliary structural signal. Results showed that multi-task mmBERT consistently outperforms the shared-task XLM-RoBERTa baseline across languages, while gains from more complex aggregation are not uniform. The findings showed that strong multilingual representations provide a competitive foundation for vocabulary item difficulty prediction, while the benefits of additional architectural complexity depend on the language and training setting.
Glite at BEA 2026 Shared Task 1: Holistic Difficulty Models Dominate, Feature Engineering Closes the Gap in L1-Aware Vocabulary Difficulty Prediction
Vassili Philippov | Dmitrii Andreev | Pavel Katunin | Anton Nikolaev
Vassili Philippov | Dmitrii Andreev | Pavel Katunin | Anton Nikolaev
This paper describes our submission to the BEA 2026 Shared Task on L1-Aware English Vocabulary Difficulty Prediction. We build per-L1 CatBoost regressors over 1,161 candidate linguistic, psycholinguistic, dictionary, and LLM-derived features drawn from 129 feature sets; out-of-fold predictions from fine-tuned encoder and decoder-LLM regression heads enter the model as additional features. Features are selected via Recursive Feature Elimination with nested cross-validation, producing compact per-L1 models of 29-150 features per run. For the closed track we introduce a per-feature-column compliance audit that classifies 57 of 129 feature sets as track-eligible under the organiser rulings, an audit that forced a rebuild of the selection and ensembling pipelines in the final week. We further show that decoder-LLM LoRA regression heads — LLaMA-3.1-8B being the single strongest model in our pool — provide the largest marginal gains in the open track, and that a simpler per-L1 CatBoost on RFE-selected features matches or exceeds Ridge-stacking ensembles over the same base models. Our systems ranked 1st in the closed track and 2nd in the open track on all three L1s (Spanish, German, Mandarin), reducing baseline RMSE by 29.9% in the closed track and 35.9% in the open track on average.
NLP-Explorers at BEA 2026 Shared Task 1: DeBERTa–CatBoost Weighted Ensemble Approach for L1-Specific Vocabulary Difficulty Prediction
Tayyab Latif | Asifa Bibi | Sabur Butt | Grigori Sidorov | Alexander Gelbukh
Tayyab Latif | Asifa Bibi | Sabur Butt | Grigori Sidorov | Alexander Gelbukh
Vocabulary difficulty prediction aims to estimate how difficult a word is for a learner. This is an important problem because word difficulty is shaped not only by the word itself, but also by the learner’s background and the context in which the word appears. In this work, we predict continuous difficulty scores for English target words using learnerspecific information. Our approach combines a fine-tuned DeBERTa v3 Large model with a CatBoost regressor trained on transformer-based embeddings. The final score is produced through weighted ensembling, where DeBERTa provides the main prediction and CatBoost adds a smaller complementary signal. Our final system achieved RMSE scores of 1.040 for Spanish, 0.992 for German, and 0.882 for Chinese. The results were also stable across multiple runs, showing that the model behaved consistently under small changes in ensemble weight. These findings show that a simple hybrid system can provide reliable performance for vocabulary difficulty prediction. They also suggest that combining strong contextual representations with a lightweight regression model is an effective way to model learner-sensitive word difficulty.
RETUYT-INCO at BEA 2026 Shared Task 1: Feature-Enriched mDeBERTa for Word Difficulty Prediction
Santiago Robaina | Aiala Rosá | Luis Chiruzzo
Santiago Robaina | Aiala Rosá | Luis Chiruzzo
We describe the RETUYT-INCO participation in the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners, a regression task that predicts GLMM psychometric difficulty scores for English target words given an L1 cue (Spanish, German, and Mandarin). We submitted two systems to the closed track (which restricts participants to the provided shared-task data and standard NLP resources, excluding external corpora and large language models): a feature-engineered XGBoost regressor for all three L1s, and, for Spanish, a 3-seed ensemble of mdeberta-v3-base fine-tuned with the same handcrafted features prepended as input text tokens. Our best test result is 1.094 RMSE on Spanish (ensemble), a 13.0% reduction over the XLM-RoBERTa-base closed baseline. We highlight two findings. First, a LaBSE cross-lingual cosine between the L1 source word and the English target word is the largest single-feature addition in our incremental ablation, reducing average development-split (dev) RMSE by 0.091 on top of an already strong string/frequency/POS feature set. Second, feature-only XGBoost, with no neural fine-tuning and no GPU, already beats the XLM-RoBERTa-base closed-track development baseline on average across the three L1s (1.273 vs. 1.287 RMSE).
Token Titans at BEA 2026 Shared Task 1: Multilingual Lexical Complexity Prediction via Fine-Tuned XLM-RoBERTa with Ensemble Decoding
Anubhab Parashar | Sandeep Mathias
Anubhab Parashar | Sandeep Mathias
We describe our submission to the BEA 2026 Shared Task on Multilingual Lexical Complexity Prediction. The system fine-tunes XLM-RoBERTa Large separately for Spanish, German, and Chinese, feeding each instance as a flat concatenation of the source word, its sentential context, an English clue, and the English target word. Training uses z-score label normalization and two independent runs thatdiffer in learning rate, scheduler, and random seed; a weighted ensemble of their predictions (0.6/0.4) consistently reduces variance on the validation set. On the official test set the system scores RMSE = 1.170 and Pearson = 0.812.
TOEBM at BEA 2026 Shared Task 1: Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling
wicaksono M. | Joanito Lopo | Tsamarah Nugraha | Ahmad Adi | Muhamad Nurfajri
wicaksono M. | Joanito Lopo | Tsamarah Nugraha | Ahmad Adi | Muhamad Nurfajri
Lexical difficulty prediction is a fundamental problem in language learning and readability assessment, requiring models to estimate word difficulty across different first-language (L1) backgrounds. However, existing approaches rely on regression-only training with scalar supervision, which does not explicitly structure the representation space, limiting their ability to capture cross-lingual alignment and ordinal difficulty. To mitigate these issues, we propose Context-Aligned Contrastive Regression, which integrates Ridge regression ensemble with two complementary objectives, i.e., Cross-View Context and Ordinal Soft Contrastive Learning. Experiments on three L1 datasets show that (i) contrastive objectives improve cross-lingual representation alignment while preserving language-specific nuances, (ii) the learned representations capture the ordinal structure of lexical difficulty, and (iii) the ensemble effectively mitigates systematic biases of individual models, leading to more stable performance across difficulty levels.
Data Asgardians at BEA 2026 Shared Task 1: A Hybrid Transformer–Feature Ensemble for L1-Aware English Vocabulary Difficulty Prediction
Adrian Pineda | Sabur Butt | Héctor Ceballos Cancino
Adrian Pineda | Sabur Butt | Héctor Ceballos Cancino
This paper presents our system for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners. The task requires predicting psychometrically calibrated GLMM difficulty scores for English vocabulary items across three learner first-language (L1) backgrounds: Spanish (ES), German (DE), and Mandarin Chinese (CN). Our approach studies how hand-crafted linguistic features can complement contextual multilingual transformer representations. We engineer 33 phonological, morphological, semantic, contextual, and cross-lingual features, and evaluate feature-only regressors, Solo transformer models, Hybrid transformer models, and prediction-level ensembling. Our official Closed Track submissions were generated with XLM-RoBERTa-large Solo and Hybrid models, which improved over the official baseline for all three L1 groups, achieving test RMSEs of 1.182 (ES), 1.117 (DE), and 1.006 (CN) with a mean of 1.103. We then conducted a post-submission refinement using mDeBERTa-v3-base components and a Ridge stacking ensemble, which further reduced test RMSE to 1.037 (ES), 0.997 (DE), and 0.913 (CN), with a mean of 0.982, a mean improvement of 0.121 over our best XLM-RoBERTa-large system.
UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction
Nouran Khallaf | Serge Sharoff
Nouran Khallaf | Serge Sharoff
This paper describes UOL@IDEM’s closed-track submission to the BEA 2026 shared task on L1-aware vocabulary difficulty prediction. We model the task as regression and train separate systems for Spanish, German, and Mandarin Chinese. Our system combines multilingual contextual representations with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show consistent gains over the official closed-track baselines, with sentence-embedding encoders such as BGE-M3, multilingual E5, and LaBSE performing best. Official submissions achieve RMSE scores of 1.132, 1.037, and 0.891 for Spanish, German, and Chinese, respectively. Feature analysis identifies frequency as the most stable predictor, while contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive signals. Error analysis shows strong ranking performance but weaker calibration for the easiest items, which are often overpredicted.
Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?
Adam Nohejl | Xuanxin Wu | Yusuke Ide | Maria Riera Machin | Yi-Ning Chang
Adam Nohejl | Xuanxin Wu | Yusuke Ide | Maria Riera Machin | Yi-Ning Chang
We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council’s Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online.
Report on the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
We present the BEA 2026 shared task on rubric-based short answer scoring for German. Rubric-based short answer scoring is a case of automatic short answer scoring (ASAS) that requires models to apply textual scoring rubrics to student answers as a basis for assigning scores. For the shared task, we introduced a novel German-language dataset from multiple STEM domains to provide a comprehensive benchmark for this problem. The dataset was designed to evaluate both performance and generalization (the latter, by distinguishing between seen and unseen questions), as well as coarse- and fine-grained scoring (2-way vs. 3-way). The systems submitted to the shared task cover a wide range of approaches, including fine-tuned large language models, prompt-based methods, human-AI collaboration strategies, or a combination of these. The results show that structured, task-adapted LLM systems achieved the strongest performance across all tracks. The winning system, IWM-DKM, combined LoRA fine-tuning of Qwen models with rubric-aware input structuring, including checklist-style reasoning, rubric reframing as decision trees, background knowledge injection, and ensemble voting. Other systems similarly relied on fine-tuned LLMs, retrieval-augmented prompting, encoder–LLM ensembles, or weighted aggregation strategies. Overall, the shared task results show that rubric-based scoring benefits most from systems that explicitly operationalise rubric semantics, while generalisation to unseen questions remains a central challenge.
Open-source LLMs with simple, zero-shot prompts are at best middling graders on the BEA 2026 Automated Grading Shared Task – blunt-edge models, in fact. However, they are good enough to support human graders and save them time. We demonstrate the application of a hybrid grading approach that first transparently defines the success criteria and then pairs a zero-shot LLM grader with human review. The hybrid approach outperforms the LLM grader on its own and has the added advantage of keeping the human in the loop.
ASLAN at BEA 2026 Shared Task 2: Voting Across Scoring Paradigms
Marie Bexte | Yuning Ding | Josef Ruppenhofer | Nils-Jonathan Schaller | Daniel Mora Melanchthon | Torsten Zesch | Andrea Horbach
Marie Bexte | Yuning Ding | Josef Ruppenhofer | Nils-Jonathan Schaller | Daniel Mora Melanchthon | Torsten Zesch | Andrea Horbach
This paper describes the ASLAN system contribution to the BEA 2026 Shared Task on rubric-based short answer scoring for German (Gombert et al., 2026). We investigate three complementary modeling paradigms: similarity-based scoring, instance-based classification, and rubric-prompted large language models (LLMs). For the unseen answers track, where test answers belong to prompts observed during training, we compare question-specific and generic scoring models as well as ensemble variants. For the unseen questions track, where models must generalize to previously unseen prompts, we primarily rely on zero-shot LLM-based scoring using the scoring rubrics. Our experiments show that similarity-based models outperform instance-based models and LLM-based models in the unseen answers setting. In addition, we find that ensemble methods improve robustness over individual models.
WSE Research at BEA 2026 Shared Task 2: Multi-Strategy Rubric-Based Short Answer Scoring for German
Jonas Gwozdz | Andreas Both
Jonas Gwozdz | Andreas Both
We describe the WSE Research system for the BEA 2026 Shared Task 2 on Rubric-based Short Answer Scoring for German. Our system combines rubric-conditioned prompting with TF-IDF exemplar retrieval, LoRA fine-tuning of open-source Qwen models, and prediction aggregation across complementary scorers. The central question is when prompt engineering, parameter-efficient adaptation, and aggregation each help for rubric-based grading. On the ALICE-LP-1.0 trial set, a fine-tuned Qwen2.5-32B reaches QWK 0.769, surpassing the strongest prompted commercial baseline (Gemini 3 Flash, 0.748). On the official test set, the system ranks second on three tracks and third on the remaining one. Overall, the results show that rubric-conditioned fine-tuning is a competitive and cost-effective alternative to commercial APIs for German short answer scoring, while aggregation helps on seen questions but larger single models generalize better to unseen rubrics.
AMATI at BEA 2026 Shared Task 2: Automatic Short Answer Grading with Inductive Logic Programming and a Large Language Model
Alistair Willis | Aisling Third
Alistair Willis | Aisling Third
We discuss the AMATI submission to the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German. Our neuro-symbolic system uses a combination of symbolic rules, automatically learned with a form of Inductive Logic Programming, and the Mistral-large language model. We wanted to investigate whether the combination would improve overall grading performance, while using the automatically induced symbolic rules for explainability, and the LLM for robustness. We find that the combination of approached resulted in improved overall performance for the 3-way task. However, including the symbolic rules did not improve upon Mistral’s performance in the 2-way test.This paper presents our approach to the unseen answers challenges. Our team finished 6th out of 9 in the 2-way challenge, and 5th out of 8 in the 3-way challenge. In the 3-way challenge, neither our symbolic system nor the use of Mistral alone would have placed higher than 6th of the 8 competitors, illustrating the improvement of the combined approach over either of the individual approaches.
IWM-DKM at BEA 2026 Shared Task 2: Supplementing Supervised Fine-Tuning for Rubric-Based Short Answer Scoring
Kate Belcher | Marius De Kuthy Meurers | Kordula De Kuthy | Detmar Meurers
Kate Belcher | Marius De Kuthy Meurers | Kordula De Kuthy | Detmar Meurers
In this paper, we present the IWM-DKM team submissions to the BEA 2026 Shared Task 2: Rubric-based Short Answer Scoring for German. We systematically explored how fine-tuned language models can be reliably employed for short answer scoring, for which three aspects turn out to be particularly beneficial: supplementing the fine-tuning process with generated domain expertise, restructured rubrics, and thinking traces. To increase the robustness of the scoring, we combine distinct approaches in an ensemble. Our best submissions finished in first place across all tracks, indicating promise for the further application of these strategies in automatic scoring.
RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German
Ignacio Sastre | Ignacio Remersaro | Facundo Díaz | Nicolás De Horta | Luis Chiruzzo | Aiala Rosá | Santiago Góngora
Ignacio Sastre | Ignacio Remersaro | Facundo Díaz | Nicolás De Horta | Luis Chiruzzo | Aiala Rosá | Santiago Góngora
In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.
SDPA at BEA 2026 Shared Task 2: Efficient LLM Fine-Tuning for Rubric-based Short Answer Scoring
Zhexiong Liu | Jing Zhang
Zhexiong Liu | Jing Zhang
Automated short-answer scoring (ASA) is an important yet challenging task in educational assessment as it aims to evaluate open-ended student responses against predefined scoring rubrics that are often interrelated. Although large language models (LLMs) have demonstrated impressive capabilities in text understanding and reasoning, their application to ASA has primarily focused on prompt-based inference, largely due to the limited availability of annotated data required for effective model training. In this work, we investigate parameter-efficient fine-tuning strategies for LLMs using ASA annotations in German. Our experiments show that fine-tuned LLMs consistently outperform both prompt-based and ensemble-based language models, suggesting domain-adaptive LLM fine-tuning is more effective than prompting alone for ASA.