Daniel Mora Melanchthon


2026

We present IPN, our system for Subtask 1 of the PARSEME 2.0 Shared Task, which targets the identification of MWEs in 17 languages. Overall, IPN outperformed a much larger-parameter baseline model, yet a performance gap to the top-performing systems remains. To better understand these results, we investigate Qwen3-32B’s suitability for mono-, cross- and multilingual MWE identification. We also explore whether this model benefits from prepending automatically generated thinking data to the gold label during instruction-tuning. We find that target language data is vital for instruction-tuning. Prepending generated thinking data to a subset of the training data slightly improves performance for two out of three languages, but more detailed evaluation is required.
This paper describes the ASLAN system contribution to the BEA 2026 Shared Task on rubric-based short answer scoring for German (Gombert et al., 2026). We investigate three complementary modeling paradigms: similarity-based scoring, instance-based classification, and rubric-prompted large language models (LLMs). For the unseen answers track, where test answers belong to prompts observed during training, we compare question-specific and generic scoring models as well as ensemble variants. For the unseen questions track, where models must generalize to previously unseen prompts, we primarily rely on zero-shot LLM-based scoring using the scoring rubrics. Our experiments show that similarity-based models outperform instance-based models and LLM-based models in the unseen answers setting. In addition, we find that ensemble methods improve robustness over individual models.
We investigate the predictive power of keystroke logging data for automated essay scoring using the newly collected PISA FLA writing process dataset. Based on 3,882 writing sessions, we extract a comprehensive set of keystroke-based process features, including temporal measures, pause and burst patterns, deletion behavior, production efficiency, and navigation activity and evaluate their ability to predict holistic essay scores on a 0–5 scale. We specifically compare process-feature-based models with content-based scoring approaches trained on data written with and without the help of an AI chatbot, and investigate how predictive power evolves over the course of a writing session by training models at multiple time thresholds.Our analysis reveals that keystroke features provide genuine early predictive signal, capturing aspects of writing fluency and revision behavior that distinguish writers before their texts are long enough to score conventionally. Additionally, our results suggest that process-based scoring is a viable complement to product-based approaches, with promise for formative, real-time feedback during writing.
Beyond performance, model transparency is a crucial factor in Automated Essay Scoring, yet current systems often lack explainability, limiting their pedagogical value and users’ trust. Existing explainability methods, such as gradient-based attribution or feature-importance approaches, either produce counterintuitive explanations or are too complex for classroom use. To address this limitation, we make use of fine-grained prediction at the sentence level as a way to enhance explainability. We propose ablation strategies to derive sentence-level pseudo scores from essay-level gold scores and use them to train sentence-level models. We evaluate their performance against essay-level baselines on two datasets (ASAP and MEWS), and compare their sentence-level output to a human baseline. Results indicate a trade-off between essay-level performance and sentence-level granularity. For the language quality trait, most sentence-level models achieve performance comparable to the essay-level baseline, whereas for content, the approach yields more positive results on prompts with shorter