Ainhoa Vivel-Couso

Also published as: Ainhoa Vivel Couso


2026

This paper introduces MeteoGalEus, a multilingual weather dataset that combines meteorological observations from two Spanish regional agencies, Euskalmet and MeteoGalicia. The dataset contains daily records spanning 4 years and 6 months, with aligned observations for both sources. MeteoGalEus captures key meteorological variables including temperature, wind and state of the sky. The dataset is provided in a structured format, facilitating data analysis and integration, with textual forecasts available in the official languages for each region (i.e., Galician and Spanish for MeteoGalicia; Euskera and Spanish for Euskalmet). By merging and harmonizing data from two regional agencies, MeteoGalEus is a unique resource for cross-regional weather analysis and multilingual climate studies. This dataset is suited for tasks requiring high-quality, aligned, and standardized weather data across multiple languages and regions. We conducted baseline experiments using LLaMA-based models in both zero-shot and fine-tuned settings to illustrate the use of MeteoGalEus for natural language generation (NLG). Fine-tuning led to consistent improvements across all metrics, with BERTScore increasing from 0.68 to 0.79, ROUGE from 0.20 to 0.35, and BLEU from 0.02 to 0.17 in the best-performing model. The experiments show how MeteoGalEus can be taken as a benchmark for multilingual and cross-regional NLG tasks.

2024

This paper presents a reproduction study aimed at reproducing and validating a human NLP evaluation performed for the DExperts text generation method. The original study introduces DExperts, a controlled text generation method, evaluated using non-toxic prompts from the RealToxicityPrompts dataset. Our reproduction study aims to reproduce the human evaluation of the continuations generated by DExperts in comparison with four baseline methods, in terms of toxicity, topicality, and fluency. We first describe the agreed approach for reproduction within the ReproHum project and detail the configuration of the original evaluation, including necessary adaptations for reproduction. Then, we make a comparison of our reproduction results with those reported in the reproduced paper. Interestingly, we observe how the human evaluators in our experiment appreciate higher quality in the texts generated by DExperts in terms of less toxicity and better fluency. All in all, new scores are higher, also for the baseline methods. This study contributes to ongoing efforts in ensuring the reproducibility and reliability of findings in NLP evaluation and emphasizes the critical role of robust methodologies in advancing the field.