Santiago Robaina

2026

RETUYT-INCO at BEA 2026 Shared Task 1: Feature-Enriched mDeBERTa for Word Difficulty Prediction
Santiago Robaina | Aiala Rosá | Luis Chiruzzo
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

We describe the RETUYT-INCO participation in the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners, a regression task that predicts GLMM psychometric difficulty scores for English target words given an L1 cue (Spanish, German, and Mandarin). We submitted two systems to the closed track (which restricts participants to the provided shared-task data and standard NLP resources, excluding external corpora and large language models): a feature-engineered XGBoost regressor for all three L1s, and, for Spanish, a 3-seed ensemble of mdeberta-v3-base fine-tuned with the same handcrafted features prepended as input text tokens. Our best test result is 1.094 RMSE on Spanish (ensemble), a 13.0% reduction over the XLM-RoBERTa-base closed baseline. We highlight two findings. First, a LaBSE cross-lingual cosine between the L1 source word and the English target word is the largest single-feature addition in our incremental ablation, reducing average development-split (dev) RMSE by 0.091 on top of an already strong string/frequency/POS feature set. Second, feature-only XGBoost, with no neural fine-tuning and no GPU, already beats the XLM-RoBERTa-base closed-track development baseline on average across the three L1s (1.273 vs. 1.287 RMSE).

2025

pdf bib abs

RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation?
Santiago Góngora | Ignacio Sastre | Santiago Robaina | Ignacio Remersaro | Luis Chiruzzo | Aiala Rosá
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research groups or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the exact F1 scores published by the organizers, our models had the following distances with respect to the winners: 6.46 in Track 1; 10.24 in Track 2; 7.85 in Track 3; 9.56 in Track 4; and 13.13 in Track 5. Considering that the minimum difference with a winner team is 6.46 points — and the maximum difference is 13.13 — according to the exact F1 score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.

Co-authors

Venues

BEA2
WS2

Fix author