Edwin Puertas


2026

This paper describes our submission to SemEval 2026 Subtask 1: Longitudinal Affect Assessment, which aims to predict continuous valence and arousal scores from chronologically ordered texts. Implement two regression based configurations built on DeBERTa fine tuning: a contextual model and a hybrid model that incorporates normalized lexical features derived from the NRC VAD lexicon. Both systems preserve temporal ordering and apply user level data splits to ensure generalization to unseen individuals. Results show competitive performance, with stronger outcomes in valence than in arousal. The integration of lexical features does not yield consistent improvements for arousal, highlighting the difficulty of modeling emotional intensity dynamics. Error analysis indicates challenges in handling implicit emotions, pragmatic ambiguity, and subtle affective shifts over time. Overall, findings underscore the importance of combining contextual representations with structured lexical knowledge while addressing longitudinal variability in emotional activation.
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 4: Narrative Similarity, a shared task on assessing semantic relatedness between short narrative texts. The task comprises two tracks: Track A requires selecting which of two candidate stories is more similar to an anchor, and Track B requires producing fixed-size story embeddings whose cosine similarity reflects narrative relatedness. We propose a unified two-stage system built on Qwen3-Embedding-0.6B. The first stage fine tunes the encoder as a bi-encoder with a 512 dimensional projection head using a composite loss combining margin ranking, pairwise softmax, and multiple negatives ranking objectives. The second stage trains a lightweight MLP head over frozen bi-encoder embeddings using pairwise interaction features, with k-foldcross-validation and logit-averaging ensemble inference. The system was trained exclusively on the official supervised data without leveraging the additional 1,900 synthetic triples generated by LLM released by the organizers. Al though the system ranked first on both tracks in the development phase, its performance did not transfer to the official test set, where it ranked 47 on Track A and 22 on Track B.
We present a system for rating word sense plausibility in ambiguous narrative contexts for SemEval-2026 Task 5. Our approach ensembles three large language models (Llama-3.1 70B, Qwen-2.5 32B, and Gemma-2 27B) using a computationally efficient, uncertainty-aware pipeline. We combine few-shot chain-of-thought prompting with selective self-consistency, which applies stochastic multiple sampling exclusively to items identified as inherently ambiguous. This targeted strategy reduces inference costs by approximately 45% while maintaining robustness in predictions. To correct the systematic bias of LLMs toward extreme ratings, we apply isotonic regression to shift the output distribution toward patterns of human judgment. Our system achieves a Spearman correlation of 0.67 and an accuracy within 0.76 standard deviations, ranking 34th out of 79 participating teams (top 43% without task-specific fine-tuning). Detailed error analysis reveals that while our system performs strongly on clear contexts (ρ = 0.78), current prompting paradigms struggle significantly to model multimodal human disagreement in genuinely ambiguous cases (ρ = 0.58), highlighting an important challenge for future work on subjective semantic tasks.
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 6: CLARITY, a shared task on automatic detection of question evasion in political interview transcripts. The task requires classifying question-answer pairs into three clarity levels (Task 1) and identifying nine evasion techniques (Task 2). We propose and evaluate two independent systems based on RoBERTa-Large. The first is a standard sequence classifier that treats each question-answer pair as a single input sequence, leveraging RoBERTa’s native two-segment encoding to model the relationship between the two texts jointly. The second is a dual-encoder architecture that processes the question and answer independently and computes geometric interaction features to model the semantic misalignment between them explicitly. Both systems are trained on Task 2 labels and derive Task 1 predictions via the hierarchical mapping proposed by the task organizers. Our best result was achieved by the standard sequence classifier, reaching Rank 10 on Task 2 and Rank 25 on Task 1.
In multilingual and multicultural contexts, LLMs require contextualization mechanisms to generate culturally coherent responses. In this sense, this study presents a LLaMA-based approach to answer short cultural questions in different languages within Task 7 of SemEval-2026 (Track 1: SAQ), without access to official training data. The system integrates controlled synthetic data generation, evidence retrieval through web snippets, and a Retrieval-Augmented Generation (RAG) framework with Few-shot learning. BLEnD is used solely as a thematic guide, ensuring semantic independence. During development, the LLaMA-3.1-8B model achieved 38.51\% global accuracy, while LLaMA-3.2-1B obtained 15.54\%. In large-scale evaluation (30,500 instances), the 1B model achieved 16.69\%, maintaining stability after prompt optimization. The results demonstrate that contextual retrieval improves multilingual cultural knowledge evaluation and highlight the importance of pipeline design and model capacity.
This work addresses the temporal ordering task of clinical frames in the Basic Life Support (BLS) subset of ClinSkillQA. A two-stage hybrid pipeline based on Qwen2-VL-2B-Instruct in a zero-shot configuration is proposed. In Stage 1, each image is processed independently to extract factual visual evidence, which is then transformed, using deterministic rules, into a structured representation. In Stage 2, ordering is formulated as an ordinal scoring task over procedural stages, with ties broken using PCA applied to multimodal embeddings. Evaluation followed the official benchmark protocol, considering Task Accuracy, Pairwise Accuracy, and BERTScore. In the test phase, the system achieved Task Accuracy = 0.17, Pairwise Micro Accuracy = 0.60, and BERT F1 = 0.71, with complete coverage in both predictions and rationales. The results demonstrate an interpretable and reproducible foundation, although challenges in fine-grained temporal discrimination remain.

2025

Thefirst approach leverages advanced LLMs, employing a chain-of-thought prompting strategywith one-shot learning and Google snippets forcontext retrieval, demonstrating superior performance. The second approach utilizes traditional NLP analysis techniques, including semantic ranking, token-level extraction, and rigorous data cleaning, to identify hallucinations
This paper presents the VerbaNexAi Lab system for SemEval-2025 Task 2: Entity-Aware Machine Translation (EA-MT), focusing on translating named entities from English to Spanish across categories such as musical works, foods, and landmarks. Our approach integrates detailed data preprocessing, enrichment with 240,432 Wikidata entity pairs, and fine-tuning of the MarianMT model to enhance entity translation accuracy. Official results reveal a COMET score of 87.09, indicating high fluency, an M-ETA score of 24.62, highlighting challenges in entity precision, and an Overall Score of 38.38, ranking last among 34 systems. While Wikidata improved translations for common entities like “Águila de San Juan,” our static methodology underperformed compared to dynamic LLM-based approaches.
Emotion intensity prediction plays a crucial role in affective computing, allowing for a more precise understanding of how emotions are conveyed in text. This study proposes a system that estimates emotion intensity levels by integrating contextual language representations with numerical emotion-based features derived from Valence, Arousal, and Dominance (VAD). The methodology combines BERT embeddings, predefined VAD values per emotion, and machine learning techniques to enhance emotion detection, without relying on external lexicons. The system was evaluated on the SemEval-2025 Task 11 Track B dataset, predicting five emotions (anger, fear, joy, sadness, and surprise) on an ordinal scale.The results highlight the effectiveness of integrating contextual representations with predefined VAD values, enabling a more nuanced representation of emotional intensity. However, challenges arose in distinguishing intermediate intensity levels, affecting classification accuracy for certain emotions. Despite these limitations, the study provides insights into the strengths and weaknesses of combining deep learning with numerical emotion modeling, contributing to the development of more robust emotion prediction systems. Future research will explore advanced architectures and additional linguistic features to enhance model generalization across diverse textual domains.
Emotion detection in text has become a highly relevant research area due to the growing interest in understanding emotional states from human interaction in the digital world. This study presents an approach for emotion detection in text using a RoBERTa-based model, optimized for multi-label classification of the emotions joy, sadness, fear, anger, and surprise in the context of the SemEval 2025 - Task 11: Bridging the Gap in Text-Based Emotion Detection competition. Advanced preprocessing strategies were incorporated, including the augmentation of the training dataset through automatic translation to improve the representativeness of less frequent emotions. Additionally, a loss function adjustment mechanism was implemented to mitigate class imbalance, enabling the model to enhance its detection capability for underrepresented categories. The experimental results reflect competitive performance, with a macro F1 of 0.6577 on the development set and 0.6266 on the test set. In the competition, the model ranked 47th, demonstrating solid performance against the challenge posed.
Ensuring food safety requires effective detection of potential hazards in food products. This paper presents the participation of VerbaNexAI in the SemEval-2025 Task 9 challenge, which focuses on the automatic identification and classification of food hazards from descriptive texts. Our approach employs a machine learning-based strategy, leveraging a Random Forest classifier combined with TF-IDF vectorization and character n-grams (n=2-5) to enhance linguistic pattern recognition. The system achieved competitive performance in hazard and product classification tasks, obtaining notable macro and micro F1 scores. However, we identified challenges such as handling underrepresented categories and improving generalization in multilingual contexts. Our findings highlight the need to refine preprocessing techniques and model architectures to enhance food hazard detection. We made the source code publicly available to encourage reproducibility and collaboration in future research.

2024

This paper presents an artificial intelligence model designed to detect semantic relationships in natural language, addressing the challenges of SemEval 2024 Task 1. Our goal is to advance machine understanding of the subtleties of human language through semantic analysis. Using a novel combination of convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and an attention mechanism, our model is trained on the STR-2022 dataset. This approach enhances its ability to detect semantic nuances in different texts. The model achieved an 81.92% effectiveness rate and ranked 24th in SemEval 2024 Task 1. These results demonstrate its robustness and adaptability in detecting semantic relationships and validate its performance in diverse linguistic contexts. Our work contributes to natural language processing by providing insights into semantic textual relatedness. It sets a benchmark for future research and promises to inspire innovations that could transform digital language processing and interaction.
This study delineates our participation in the SemEval-2024 Task 3: Multimodal Emotion Cause Analysis in Conversations, focusing on developing and applying an innovative methodology for emotion detection and cause analysis in conversational contexts. Leveraging logistic regression, we analyzed conversational utterances to identify emotions per utterance. Subsequently, we employed a dependency analysis pipeline, utilizing SpaCy to extract significant chunk features, including object, subject, adjectival modifiers, and adverbial clause modifiers. These features were analyzed within a graph-like framework, conceptualizing the dependency relationships as edges connecting emotional causes (tails) to their corresponding emotions (heads). Despite the novelty of our approach, the preliminary results were unexpectedly humbling, with a consistent score of 0.0 across all evaluated metrics. This paper presents our methodology, the challenges encountered, and an analysis of the potential factors contributing to these outcomes, offering insights into the complexities of emotion-cause analysis in multimodal conversational data.
This study introduces an innovative approach to emotion recognition and reasoning about emotional shifts in code-mixed conversations, leveraging the NRC VAD Lexicon and computational models such as Transformer and GRU. Our methodology systematically identifies and categorizes emotional triggers, employing Emotion Flip Reasoning (EFR) and Emotion Recognition in Conversation (ERC). Through experiments with the MELD and MaSaC datasets, we demonstrate the model’s precision in accurately identifying emotional shift triggers and classifying emotions, evidenced by a significant improvement in accuracy as shown by an increase in the F1 score when including VAD analysis. These results underscore the importance of incorporating complex emotional dimensions into conversation analysis, paving new pathways for understanding emotional dynamics in code-mixed texts.
The automatic identification of medical errors in clinical notes is crucial for improving the quality of healthcare services.LLMs emerge as a powerful artificial intelligence tool for automating this task. However, LLMs present vulnerabilities, high costs, and sometimes a lack of transparency. This article addresses the detection of medical errors through the fine-tuning approach, conducting a comprehensive comparison between various models and exploring in depth the components of the machine learning pipeline. The results obtained with the fine-tuned ClinicalBert and Gated recurrent units (Gru) models show an accuracy of 0.56 and 0.55, respectively. This approach not only mitigates the problems associated with the use of LLMs but also demonstrates how exhaustive iteration in critical phases of the pipeline, especially in feature selection, can facilitate the automation of clinical record analysis.

2023

Nowadays, persuasive messages are more and more frequent in social networks, which generates great concern in several communities, given that persuasion seeks to guide others towards the adoption of ideas, attitudes or actions that they consider to be beneficial to themselves. The efficient detection of news genre categories, detection of framing and detection of persuasion techniques requires several scientific disciplines, such as computational linguistics and sociology. Here we illustrate how we use lexical features given a news article, determine whether it is an opinion piece, aims to report factual news, or is satire. This paper presents a novel strategy for news based on Lexical Weirdness. The results are part of our participation in subtasks 1 and 2 in SemEval 2023 Task 3.