Stefan Keller


2026

Beyond performance, model transparency is a crucial factor in Automated Essay Scoring, yet current systems often lack explainability, limiting their pedagogical value and users’ trust. Existing explainability methods, such as gradient-based attribution or feature-importance approaches, either produce counterintuitive explanations or are too complex for classroom use. To address this limitation, we make use of fine-grained prediction at the sentence level as a way to enhance explainability. We propose ablation strategies to derive sentence-level pseudo scores from essay-level gold scores and use them to train sentence-level models. We evaluate their performance against essay-level baselines on two datasets (ASAP and MEWS), and compare their sentence-level output to a human baseline. Results indicate a trade-off between essay-level performance and sentence-level granularity. For the language quality trait, most sentence-level models achieve performance comparable to the essay-level baseline, whereas for content, the approach yields more positive results on prompts with shorter

2023

2022