Ignacio Remersaro

2026

RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German
Ignacio Sastre | Ignacio Remersaro | Facundo Díaz | Nicolás De Horta | Luis Chiruzzo | Aiala Rosá | Santiago Góngora
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.

2025

pdf bib abs

RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation?
Santiago Góngora | Ignacio Sastre | Santiago Robaina | Ignacio Remersaro | Luis Chiruzzo | Aiala Rosá
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research groups or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the exact F1 scores published by the organizers, our models had the following distances with respect to the winners: 6.46 in Track 1; 10.24 in Track 2; 7.85 in Track 3; 9.56 in Track 4; and 13.13 in Track 5. Considering that the minimum difference with a winner team is 6.46 points — and the maximum difference is 13.13 — according to the exact F1 score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.

Co-authors

Facundo Díaz 1

Santiago Robaina 1

Venues

BEA2
WS2

Fix author