Aitor Álvarez

Also published as: Aitor Arronte Álvarez, Aitor Arronte Alvarez


2025

Weakly supervised learning (WSL) is a machine learning approach used when labeled data is scarce or expensive to obtain. In such scenarios, models are trained using weaker supervision sources instead of human-annotated data. However, these sources are often noisy and may introduce unquantified biases during training. This issue is particularly pronounced in automated scoring (AS) of second language (L2) learner output, where high variability and limited generalizability pose significant challenges.In this paper, we investigate analytical scoring of L2 learner responses under weak and semi-supervised learning conditions, leveraging Prediction-Powered Inference (PPI) to provide statistical guarantees on score validity. We compare two approaches: (1) synthetic scoring using large language models (LLMs), and (2) a semi-supervised setting in which a machine learning model, trained on a small gold-standard set, generates predictions for a larger unlabeled corpus. In both cases, PPI is applied to construct valid confidence intervals for assessing the reliability of the predicted scores.Our analysis, based on a dataset of L2 learner conversations with an AI agent, shows that PPI is highly informative for evaluating the quality of weakly annotated data. Moreover, we demonstrate that PPI can increase the effective sample size by over 150% relative to the original human-scored subset, enabling more robust inference in educational assessment settings where labeled data is scarce.
This demo paper presents a prototype of a multilingual, speech-based driver assistant, designed to support both English and Basque languages. The inclusion of Basque—a low-resource language with limited domain-specific training data—marks a significant contribution, as publicly available AI models, including Large Language Models, often underperform for such languages compared to high-resource languages like English. Despite these challenges, our system demonstrates robust performance, successfully understanding user queries and delivering rapid responses in a demanding environment: a car simulator. Notably, the system achieves comparable performance in both English and Basque, showcasing its effectiveness in addressing linguistic disparities in AI-driven applications. A demo of our prototype will be available in the workshop.

2024

2016

This paper describes the evaluation methodology followed to measure the impact of using a machine learning algorithm to automatically segment intralingual subtitles. The segmentation quality, productivity and self-reported post-editing effort achieved with such approach are shown to improve those obtained by the technique based in counting characters, mainly employed for automatic subtitle segmentation currently. The corpus used to train and test the proposed automated segmentation method is also described and shared with the community, in order to foster further research in this area.

2015

2014

This paper describes the data collection, annotation and sharing activities carried out within the FP7 EU-funded SAVAS project. The project aims to collect, share and reuse audiovisual language resources from broadcasters and subtitling companies to develop large vocabulary continuous speech recognisers in specific domains and new languages, with the purpose of solving the automated subtitling needs of the media industry.
Long audio alignment systems for Spanish and English are presented, within an automatic subtitling application. Language-specific phone decoders automatically recognize audio contents at phoneme level. At the same time, language-dependent grapheme-to-phoneme modules perform a transcription of the script for the audio. A dynamic programming algorithm (Hirschberg’s algorithm) finds matches between the phonemes automatically recognized by the phone decoder and the phonemes in the script’s transcription. Alignment accuracy is evaluated when scoring alignment operations with a baseline binary matrix, and when scoring alignment operations with several continuous-score matrices, based on phoneme similarity as assessed through comparing multivalued phonological features. Alignment accuracy results are reported at phoneme, word and subtitle level. Alignment accuracy when using the continuous scoring matrices based on phonological similarity was clearly higher than when using the baseline binary matrix.