2025
pdf
bib
abs
RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation?
Santiago Góngora
|
Ignacio Sastre
|
Santiago Robaina
|
Ignacio Remersaro
|
Luis Chiruzzo
|
Aiala Rosá
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research groups or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the exact F1 scores published by the organizers, our models had the following distances with respect to the winners: 6.46 in Track 1; 10.24 in Track 2; 7.85 in Track 3; 9.56 in Track 4; and 13.13 in Track 5. Considering that the minimum difference with a winner team is 6.46 points — and the maximum difference is 13.13 — according to the exact F1 score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.
pdf
bib
abs
Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings
Ignacio Sastre
|
Aiala Rosá
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)
In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model’s weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across English and Spanish datasets, sequences of up to approximately 240 tokens, and model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.
2024
pdf
bib
abs
RETUYT-INCO at MLSP 2024: Experiments on Language Simplification using Embeddings, Classifiers and Large Language Models
Ignacio Sastre
|
Leandro Alfonso
|
Facundo Fleitas
|
Federico Gil
|
Andrés Lucas
|
Tomás Spoturno
|
Santiago Góngora
|
Aiala Rosá
|
Luis Chiruzzo
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
In this paper we present the participation of the RETUYT-INCO team at the BEA-MLSP 2024 shared task. We followed different approaches, from Multilayer Perceptron models with word embeddings to Large Language Models fine-tuned on different datasets: already existing, crowd-annotated, and synthetic.Our best models are based on fine-tuning Mistral-7B, either with a manually annotated dataset or with synthetic data.
2023
pdf
bib
abs
RETUYT-InCo at BEA 2023 Shared Task: Tuning Open-Source LLMs for Generating Teacher Responses
Alexis Baladón
|
Ignacio Sastre
|
Luis Chiruzzo
|
Aiala Rosá
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
This paper presents the results of our participation in the BEA 2023 shared task, which focuses on generating AI teacher responses in educational dialogues. We conducted experiments using several Open-Source Large Language Models (LLMs) and explored fine-tuning techniques along with prompting strategies, including Few-Shot and Chain-of-Thought approaches. Our best model was ranked 4.5 in the competition with a BertScore F1 of 0.71 and a DialogRPT final (avg) of 0.35. Nevertheless, our internal results did not exactly correlate with those obtained in the competition, which showed the difficulty in evaluating this task. Other challenges we faced were data leakage on the train set and the irregular format of the conversations.