Ricardo de Córdoba

Also published as: Ricardo de Cordoba, Ricardo Córdoba, R. Cordoba


2026

Conversational AI is a central application of NLP, yet ensuring high response quality remains challenging due to the inherently subjective nature of user satisfaction. Dialogue evaluation can be performed manually—through expert or user ratings—or automatically, using methods that aim to predict quality scores consistent with human judgment. In this work, we present a reference-free automatic dialogue evaluation system that predicts user ratings from a dataset of real human–chatbot interactions collected during the Alexa Prize Socialbot Grand Challenge 5, combining multiple complementary models to enhance correlation with human scores. Experimental results indicate that the model that achieves the highest Pearson correlation with users’ ratings is an XGBoost regression model that combines different features such as conversation length, engineered flags capturing conversation characteristics, predictions from an Encoder-based Panel of Experts (PoE), and instruction-based outputs from a fine-tuned LLM. The overall Pearson Correlation on the eval set is 0.404, which is competitive with prior work trained on an order of magnitude more dialogues, albeit using different datasets and system configurations.
Industry stakeholders are willing to incorporate AI systems in their pipelines, therefore they want agentic flexibility without losing the guaranties and auditability of fixed pipelines. This paper describes ORCHESTRA, a portable and extensible microservice architecture for orchestrating customizable multimodal AI workflows across domains. It embeds Large Language Model (LLM) agents within a deterministic control flow, combining reliability with adaptive reasoning. A Dockerized Manager routes text, speech, and image requests through specialist workers for ASR, emotion analysis, retrieval, guardrails, and TTS, ensuring that multimodal processing, safety checks, logging, and memory updates are consistently executed, while scoped agent nodes adjust prompts and retrieval strategies dynamically. The system scales via container replication and exposes per-step observability through open-source dashboards. We ground the discussion in a concrete deployment: an interactive museum guide that handles speech and image queries, personalizes narratives with emotion cues, invokes tools, and enforces policy-compliant responses. From this application, we report actionable guidance: interface contracts for services, where to place pre/post safety passes, how to structure memory for RAG, and common failure modes with mitigations. We position the approach against fully agentic and pure pipeline baselines, outline trade-offs (determinism vs. flexibility, latency budget), and sketch near-term extensions such as sharded managers, adaptive sub-flows, and streaming inference. Our goal is to provide a reusable blueprint for safely deploying agent-enhanced, multimodal assistants in production, illustrated through the museums use case.
This paper describes our system for the EEUCA 2026 Shared Task on toxicity classification in gaming chat. We implement a three-stage pipeline combining an ensemble of two compact transformers (DeBERTa-v3-base, 184M; XLM-RoBERTa-base, 278M) with a Linguistically-Informed Mediator (LIM) that resolves inter-model disagreements through corpus-backed lexical normalization, class-conditional unigram scoring, multilingual profanity detection, and agentive targeting analysis grounded in speech act theory. The LIM specifically targets the minority classes (Hate Harassment, Threats, and Extremism), which are the most safety-critical categories in real-world gaming moderation. To address the extreme class imbalance (1,450:1 Non-toxic to Extremism ratio), we introduce a two-stage data augmentation strategy using only the provided training data. Our system achieves a Macro F1 of 0.6441 and accuracy of 0.9062 on the official test set, ranking 3rd in Macro F1 and 1st in accuracy among all teams. The proposed pipeline is domain-portable: adapting to other gaming platforms requires substituting only the game-specific entity lexicon. Code is publicly available at https://github.com/Anmol2059/thaulab_EEUCA.

2025

Recent developments in Multimodal Large Language Models (MLLMs) have provided novel insights into Speech Emotion Recognition (SER). However, combining high-dimensional speech signals with textual tokens can lead to a rapid growth in input tokens, increasing computational costs and inference times. This “token overload” also risks shadowing essential textual cues, affecting the reasoning capabilities of the language model and diluting emotional information crucial to accurate SER. In this paper, we explore different token drop methods that mitigate excessive token counts while preserving both emotional nuances and the core linguistic capabilities of the model. Specifically, we compare various pooling approaches to produce a compact representation. Our preliminary findings suggest that these techniques can reduce computational costs without decreasing SER accuracy.

2009

2004