Marcos Estecha-Garitagoitia


2026

Conversational AI is a central application of NLP, yet ensuring high response quality remains challenging due to the inherently subjective nature of user satisfaction. Dialogue evaluation can be performed manually—through expert or user ratings—or automatically, using methods that aim to predict quality scores consistent with human judgment. In this work, we present a reference-free automatic dialogue evaluation system that predicts user ratings from a dataset of real human–chatbot interactions collected during the Alexa Prize Socialbot Grand Challenge 5, combining multiple complementary models to enhance correlation with human scores. Experimental results indicate that the model that achieves the highest Pearson correlation with users’ ratings is an XGBoost regression model that combines different features such as conversation length, engineered flags capturing conversation characteristics, predictions from an Encoder-based Panel of Experts (PoE), and instruction-based outputs from a fine-tuned LLM. The overall Pearson Correlation on the eval set is 0.404, which is competitive with prior work trained on an order of magnitude more dialogues, albeit using different datasets and system configurations.
An ongoing challenge in multimodal language research is creating and interpreting dialogues that preserve visual and cultural consistency across turns. We introduce DREAM (Dialogue to REAlistic Multicultural Image Sequences), a multicultural multimodal resource that ties dialogues grounded in explicit persona profiles to photorealistic, storyboard-like image sequences. Each of the 1,000 dialogues includes two rich persona profiles (structured traits plus descriptive language), two matching photorealistic portraits, and a collection of scene-level images depicting key dialogue moments. The pipeline integrates profile augmentation, culturally-sensitive prompt engineering, and turn selection to craft cohesive visual narratives, promoting character consistency across images. This is accomplished through a controlled generation process employing large language and image models. Beyond dialogue grounding, DREAM supports appearance-based demographic perception and culture-aware rendering: models can be evaluated on their ability to (i) perceive age, gender presentation, and broad ethnicity appearance clusters from profile portraits, and (ii) maintain these characteristics in dialogue scenes. We provide a unified JSON format integrating profiles, dialogue text, and visual turns, facilitating research on visually anchored dialogue understanding, consistency, and generation. A dual evaluation protocol combines human judgments (realism, coherence, consistency, and demographic perception) with automated portrait analysis via GPT-5. Ethical considerations, limitations, and recommended applications are discussed.

2025

Recent studies suggest that increasing the context window of language models could outperform retrieval-augmented generation (RAG) methods in certain tasks. However, in domains such as art and museums, where information is inherently multimodal, combining images and detailed textual descriptions, this assumption needs closer examination. To explore this, we compare RAG techniques with direct large-context input approaches for answering questions about artworks. Using a dataset of painting images paired with textual information, we develop a synthetic database of question-answer (QA) pairs for evaluating these methods. The focus is on assessing the efficiency and accuracy of RAG in retrieving and using relevant information compared to passing the entire textual context to a language model. Additionally, we experiment with various strategies for segmenting and retrieving text to optimise the RAG pipeline. The results aim to clarify the trade-offs between these approaches and provide valuable insights for interactive systems designed for art and museum contexts.